OpenSource For You

In this month’s column, we continue our discussion on natural language processing, focusing on automatic key phrase extraction.

-

Over the past few months, we have been discussing natural language processing algorithms. Let us continue our discussion with the focus on automatic key phrase extraction. First, let us understand what exactly automatic key phrase extraction is and why it is important in the context of natural language processing.

Automatic key phrase extraction refers to the automatic identifica­tion and selection of important and topical phrases contained in the document being examined. In layman’s terms, key phrases can be considered as the phrases from the document, which represent the topic or subject matter that is being discussed in the document. This definition of automatic key phrase extraction is quite subjective in that we are now left with the question of, “What is important and topical for a given document?” Well, the answer would depend on the consumer of that informatio­n. Key phrases have multiple uses in informatio­n retrieval and text mining. They can be used to categorise documents and to automatica­lly generate indexes of the documents, so that they can be searched by user queries. Consider a large digital library, where documents get added in huge numbers. In order to index/categorise them, automatic key phrase extraction can be used. Key phrase extraction­s can also be used for summarisat­ion of documents and to answer questions.

As human beings, we unconsciou­sly perform key phrase detection when we read through documents or listen to texts. We pick out the relevant key phrases, which convey the important informatio­n or facts, and ignore the rest, automatica­lly, without any explicit effort on our part. However, doing this with a computer is not an easy process. Humans use their knowledge of the world and the context of the document to identify what informatio­n is ‘key’ and what is not. Computers cannot mimic such human-like heuristics, because many of these filters are learnt and are based on our knowledge of the world. So how do computers detect key phrases automatica­lly?

Automatic key phrase extraction involves three main steps. The first step is the candidate generation phase, during which a set of phrases are selected as ‘candidate key phrases’ using certain heuristics or rules. The second step is to determine which of the candidates from Step 1 need to be retained as key phrases and which need to be pruned. Typically, supervised learning algorithms or unsupervis­ed learning algorithms are employed in this step. The third step is to rank these key phrases. Some of the approaches omit the third step and do not provide a ranking of key phrases.

Certain factors play a major role in key phrase extraction. For example, the length of the document determines the number of candidate key phrases generated in Step 1. The longer the document, the more the number of candidate key phrases that get generated and this makes Step 2 much more compute intensive. Typically, emails, news articles and scientific abstracts are much shorter in length and, hence, the number of key phrases identified in Step 1 is much smaller for them than for long scientific articles or technical reports.

The structure of the document also plays a key role in key phrase extraction. Most documents have a certain structure or at least a ‘semi-structure’ which can be used in key phrase extraction. It is well known in language processing literature that the key ideas in a document are typically at the beginning and end of most documents. For instance, in structured documents such as scientific

 ??  ??

Newspapers in English

Newspapers from India