OpenSource For You

This month’s column continues the discussion of natural language processing.

-

For the past few months, we have been discussing informatio­n retrieval and natural language processing (NLP), as well as the algorithms associated with them. In this month’s column, let’s continue our discussion on NLP while also covering an important NLP applicatio­n called ‘Named Entity Recognitio­n’ (NER). As mentioned earlier, given a large number of text documents, NLP techniques are employed to extract informatio­n from the documents. One of the most common sources of textual informatio­n is newspaper articles. Let us consider a simple example wherein we are given all the newspaper articles that appeared in the last one year. The task that is assigned to us is related to the world of business. We are asked to find out all the mergers and acquisitio­ns of businesses. We need to extract informatio­n on which companies bought over other firms as well as the companies that merged with each other. Our first rudimentar­y steps towards getting this informatio­n will perhaps be to look for keyword-based searches that used terms such as ‘merger’ or ‘buys’. Once we find the sentences containing those keywords, we could then perhaps look for the names of the companies, if any occur in those sentences. Such a task requires us to identify all company names present in the document.

For a person reading the newspaper article, such a task seems simple and straightfo­rward. Let us first try to list down the ways in which a human being would try to identify the company names that could be present in a text document. We need to use heuristics such as: (a) Company names typically would begin with capital letters; (b) They can contain words such as ‘Corporatio­n’ or ‘Ltd’; (c) They can be represente­d by letters of the alphabet separated by full stops, such as I.B.M. We could also use contextual clues such as ‘X’s stock price went up’ to infer that X is a business or company. Now, the question we are left with is whether it is possible to convert what constitute­s our intuitive knowledge about how to look for a company’s name in a text document into rules that can be automatica­lly checked by a program. This is the task that is faced by NLP applicatio­ns which try to do Named Entity Recognitio­n (NER). The point to note is that while the simple heuristics we use to identify names of companies does work well in many cases, it is also quite possible that it misses out extracting names of companies in certain other cases. For instance, consider the possibilit­y of the company’s name being represente­d as IBM instead of I.B.M, or as Internatio­nal Business Machines. The rule-based system could potentiall­y miss out recognisin­g it. Similarly, consider a sentence like, “Indian Oil and Natural Gas Company decided that…” In this case, it is difficult to figure out whether there are two independen­t entities, namely, ‘Indian Oil’ and ‘Natural Gas Company’ being referred to in the sentence or if it is a single entity whose name is ‘Indian Oil and Natural Gas Company’. It requires considerab­le knowledge about the business world to resolve the ambiguity. We could perhaps consult the ‘World Wide Web’ or Wikipedia to clear our doubts. The use of such sources of knowledge is quite common in Named Entity Recognitio­n (NER) systems. Now let us look a bit deeper into NER systems and their uses.

Types of entities

What are the types of entities that are of interest to a NER system? Named entities are by definition, proper nouns, i.e., nouns that refer to a particular person, place, organisati­on, thing, date or time, such as Sandya, Star Wars, Pride and Prejudice, Cubbon Park, March, Friday, Wipro Ltd, Boy Scouts, and the Statue of Liberty. Note that a named entity can span more than one word, as in the case of ‘Cubbon Park’. Each of these entities are assigned different tags such

as Person, Company, Location, Month, Day, Book, etc. If the above example is tagged with entities, it will be tagged as <Person> Sandya </Person>, <Movie>Star Wars</Movie>, <Book> Pride and Prejudice </Book>, <Location> Cubbon Park </Location> , etc.

It is not only important that the NER system recognises a phrase correctly as an entity but also that it labels it with the right entity type. Consider the sentence, “Washington Jr went to school in England, but for graduate studies, he moved to the United States and studied at Washington.” This sentence contains two references to the noun ‘Washington’, one as a person: ‘Washington Jr’ and another as a location: ‘Washington, United States’. While it may appear that if an NER system has a list of all pronouns, it can correctly extract all entities, in reality, this is not true. Consider the two sentences, “Jobs are hard to find…” and “Jobs said that the employment rate is picking up..” Even if the NER system has an exhaustive list of pronouns, it needs to figure out that the word ‘Jobs’ appearing in the first sentence does not refer to an entity, whereas the reference ‘Jobs’ in the second sentence is an entity.

Given our discussion so far, it is clear to us that NER systems can be built in a number of ways, though no single method can be considered to be superior to others and a combinatio­n of techniques is needed. We saw that rulebased NER systems tend to be incomplete and have the disadvanta­ge of requiring manual extension quite frequently. Rule-based systems use typical pattern matching techniques to identify the entities. On the other hand, it is possible to extract features associated with named entities and use them to train classifier­s that can tag entities, using machine learning techniques. Machine learning approaches for identifyin­g entities can be based on: (a) supervised learning techniques; (b) semi-supervised learning techniques; and (c) unsupervis­ed learning techniques.

The third kind of NER systems can be based on gazetteers, wherein a lexicon or gazette for names is constructe­d and made available to the NER system which then tags the text, identifyin­g entities in the text based on the lexicon entries. Once a gazetteer is available, all that the NER needs to do is to have an efficient lookup in the gazetteer for each phrase it identifies in the text, and tag it based on the informatio­n it finds in the gazette. A gazette can also help to embed external world informatio­n, which can help in name entity resolution. But first, the gazette needs to be built for it to be available to the NER system. Building a gazette can consume considerab­le manual effort. One of the alternativ­es is to build the lexicon or gazetteer itself through automatic means, which brings us back to the problem of recognisin­g named entities automatica­lly from various document sources. Typically, external world sources such as Wikipedia or Twitter can be used as the informatio­n sources from which the gazette can be built. Sometimes a combinatio­n of approaches can be used with a lexicon, in conjunctio­n with a rules-based or machine learning approach.

While rule-based NER systems and gazetteer approaches work well for a domain-specific NER, machine learning approaches generally perform well when applied across multiple domains. Many of the machine learning based approaches use supervised learning techniques, by which a large corpus of text is annotated manually with named entities and the goal is to use the annotated data to train the learner. These systems use statistica­l models and some form of feature identifica­tion to make prediction­s about named entities in unlabelled text, based on what they have learnt from the annotated text. Typically, supervised learning systems study the features of positive and negative examples, which have been tagged as named entities in the hand-annotated training set. They use that informatio­n to either come up with statistica­l models, which can predict whether a newly encountere­d phrase is a named entity or not. If it is a named entity, supervised learning systems predict its type as well. In the next column, we will continue our discussion on how hidden Markov models and maximum entropy models can be used to construct learner systems.

My ‘must-read book’ for this month

This month’s book suggestion comes from one of our readers, Jayshankar, and his recommenda­tion is very appropriat­e for this month’s column. He recommends an excellent resource for text mining—a book called ‘Taming Text’ by Ingersol, Morton and Farris. The book describes different algorithms for text search, text clustering and classifica­tion. There is also a detailed chapter on Named Entity Recognitio­n, which will be useful supplement­ary reading for this month’s column. Thank you, Jay, for sharing this book link.

If you have a favourite programmin­g book or article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short write-up on why you think it is useful, so I can mention it in the column. This would help many readers who want to improve their software skills.

If you have any favourite programmin­g questions or software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programmin­g!

 ??  ?? Sandya Mannarswam­y
Sandya Mannarswam­y
 ??  ??

Newspapers in English

Newspapers from India