OpenSource For You

In this month’s column, we discuss informatio­n extraction.

-

In the last couple of columns, we have been discussing computer science interview questions. This month, let’s return to the subject of natural language processing. In particular, let’s focus on informatio­n extraction.

Given a piece of text, the goal is to extract the informatio­n contained in it, using natural language processing techniques. We know that vast amounts of informatio­n are present on the World Wide Web. However, much of this informatio­n lies hidden in the form of unstructur­ed and semi-structured text. Today, much of this informatio­n is presented to us through search engine results, as Web pages, and we have to spend considerab­le time in reading through the individual Web pages in order to find the specific informatio­n we are interested in.

For example, consider the following query presented on a Web search engine: “Who were the presidents of the United States that were assassinat­ed?” Google returns a boxed answer with the exact answer, naming the four US presidents who were assassinat­ed while in office. Questions which are of the type ‘What, where, when, who…’ are known as ‘Wh*’ factual questions. Search engines use knowledge bases such as Freebase, Wikipedia, etc, to find answers to ‘Wh*’ questions.

On the other hand, let us try another example query: “Why did Al Gore concede the election?” This time, Google returns thousands of search engine result pages (SERP) and does not point us to an exact answer. Perhaps this was a tough question; so let us try another: “Why did Al Pacino refuse his Oscar award?” Again, Google is stumped in providing a succinct answer; instead, it showers us with thousands of SERPs and makes us hunt for the answer ourselves. These two examples may make us think that search engines are very good in answering questions which are of the ‘What, Who, Where, When’ category, but not so good when it comes to answering ‘Why’ questions. But that is not actually true. As a final example, consider the question: “Why is the sky blue?” Google returns the exact answer in a box right on top of search results. So not all ‘why’ questions are difficult.

One of the reasons Google could easily answer the last question is possibly because the informatio­n needed to answer this question was extracted from the relevant Web pages with a sufficient confidence threshold for the search engine to say that this is probably the right answer. On the other hand, for questions such as “Why did Al Gore concede the election?”, it could not extract the relevant informatio­n needed to answer the question, with a sufficient confidence threshold to provide a succinct answer. However, if we pursue the search engine results, the first link, namely the Wikipedia article on Al Gore’s presidenti­al campaign, does indeed contain the answer to the question, “Gore strongly disagreed with the Court’s decision, but said: ‘For the sake of our unity as a people and the strength of our democracy, I offer my concession.” However, this informatio­n could not be effectivel­y extracted by the search engine to offer the exact answer, though it was present in the very first search engine result. Hence the question we want to consider in today’s column is: What makes informatio­n extraction difficult and how can we effectivel­y extract the informatio­n we need to answer our queries?

In earlier columns, we discussed closed and open informatio­n extraction techniques. Basically, given a set of text documents, informatio­n extraction systems are intended to analyse this unstructur­ed text and extract informatio­n in a structured form. The structure supporting the informatio­nal needs of the particular domain is the domain ontology. In laymen’s terms, we can think of the ontology as the structure or schema of the structured database. Ontology defines the entity types, their relationsh­ip types and also in certain cases, relevant entities of interest. For instance, a medical ontology could include DISEASE, SYMPTOM and DRUG as entity types; and DISEASE causes SYMPTOMS, DRUG treats DISEASE, etc, as relationsh­ip types. It can also include optionally, entities such as diabetes, asthma, heart disease as concrete entities of interest.

Newspapers in English

Newspapers from India