In this month’s column, we continue our discussion on natural language processing, focusing on automatic key phrase extraction.

2015-05-10 -

Over the past few months, we have been discussing natural language processing algorithms. Let us continue our discussion with the focus on automatic key phrase extraction. First, let us understand what exactly automatic key phrase extraction is and why it is important in the context of natural language processing.

Automatic key phrase extraction refers to the automatic identification and selection of important and topical phrases contained in the document being examined. In layman’s terms, key phrases can be considered as the phrases from the document, which represent the topic or subject matter that is being discussed in the document. This definition of automatic key phrase extraction is quite subjective in that we are now left with the question of, “What is important and topical for a given document?” Well, the answer would depend on the consumer of that information. Key phrases have multiple uses in information retrieval and text mining. They can be used to categorise documents and to automatically generate indexes of the documents, so that they can be searched by user queries. Consider a large digital library, where documents get added in huge numbers. In order to index/categorise them, automatic key phrase extraction can be used. Key phrase extractions can also be used for summarisation of documents and to answer questions.

As human beings, we unconsciously perform key phrase detection when we read through documents or listen to texts. We pick out the relevant key phrases, which convey the important information or facts, and ignore the rest, automatically, without any explicit effort on our part. However, doing this with a computer is not an easy process. Humans use their knowledge of the world and the context of the document to identify what information is ‘key’ and what is not. Computers cannot mimic such human-like heuristics, because many of these filters are learnt and are based on our knowledge of the world. So how do computers detect key phrases automatically?

Automatic key phrase extraction involves three main steps. The first step is the candidate generation phase, during which a set of phrases are selected as ‘candidate key phrases’ using certain heuristics or rules. The second step is to determine which of the candidates from Step 1 need to be retained as key phrases and which need to be pruned. Typically, supervised learning algorithms or unsupervised learning algorithms are employed in this step. The third step is to rank these key phrases. Some of the approaches omit the third step and do not provide a ranking of key phrases.

Certain factors play a major role in key phrase extraction. For example, the length of the document determines the number of candidate key phrases generated in Step 1. The longer the document, the more the number of candidate key phrases that get generated and this makes Step 2 much more compute intensive. Typically, emails, news articles and scientific abstracts are much shorter in length and, hence, the number of key phrases identified in Step 1 is much smaller for them than for long scientific articles or technical reports.

The structure of the document also plays a key role in key phrase extraction. Most documents have a certain structure or at least a ‘semi-structure’ which can be used in key phrase extraction. It is well known in language processing literature that the key ideas in a document are typically at the beginning and end of most documents. For instance, in structured documents such as scientific

In this month’s column, we continue our discussion on natural language processing, focusing on automatic key phrase extraction.

Newspapers in English

Newspapers from India