In this month’s column, we discuss information extraction.
In the last couple of columns, we have been discussing computer science interview questions. This month, let’s return to the subject of natural language processing. In particular, let’s focus on information extraction.
Given a piece of text, the goal is to extract the information contained in it, using natural language processing techniques. We know that vast amounts of information are present on the World Wide Web. However, much of this information lies hidden in the form of unstructured and semi-structured text. Today, much of this information is presented to us through search engine results, as Web pages, and we have to spend considerable time in reading through the individual Web pages in order to find the specific information we are interested in.
For example, consider the following query presented on a Web search engine: “Who were the presidents of the United States that were assassinated?” Google returns a boxed answer with the exact answer, naming the four US presidents who were assassinated while in office. Questions which are of the type ‘What, where, when, who…’ are known as ‘Wh*’ factual questions. Search engines use knowledge bases such as Freebase, Wikipedia, etc, to find answers to ‘Wh*’ questions.
On the other hand, let us try another example query: “Why did Al Gore concede the election?” This time, Google returns thousands of search engine result pages (SERP) and does not point us to an exact answer. Perhaps this was a tough question; so let us try another: “Why did Al Pacino refuse his Oscar award?” Again, Google is stumped in providing a succinct answer; instead, it showers us with thousands of SERPs and makes us hunt for the answer ourselves. These two examples may make us think that search engines are very good in answering questions which are of the ‘What, Who, Where, When’ category, but not so good when it comes to answering ‘Why’ questions. But that is not actually true. As a final example, consider the question: “Why is the sky blue?” Google returns the exact answer in a box right on top of search results. So not all ‘why’ questions are difficult.
One of the reasons Google could easily answer the last question is possibly because the information needed to answer this question was extracted from the relevant Web pages with a sufficient confidence threshold for the search engine to say that this is probably the right answer. On the other hand, for questions such as “Why did Al Gore concede the election?”, it could not extract the relevant information needed to answer the question, with a sufficient confidence threshold to provide a succinct answer. However, if we pursue the search engine results, the first link, namely the Wikipedia article on Al Gore’s presidential campaign, does indeed contain the answer to the question, “Gore strongly disagreed with the Court’s decision, but said: ‘For the sake of our unity as a people and the strength of our democracy, I offer my concession.” However, this information could not be effectively extracted by the search engine to offer the exact answer, though it was present in the very first search engine result. Hence the question we want to consider in today’s column is: What makes information extraction difficult and how can we effectively extract the information we need to answer our queries?
In earlier columns, we discussed closed and open information extraction techniques. Basically, given a set of text documents, information extraction systems are intended to analyse this unstructured text and extract information in a structured form. The structure supporting the informational needs of the particular domain is the domain ontology. In laymen’s terms, we can think of the ontology as the structure or schema of the structured database. Ontology defines the entity types, their relationship types and also in certain cases, relevant entities of interest. For instance, a medical ontology could include DISEASE, SYMPTOM and DRUG as entity types; and DISEASE causes SYMPTOMS, DRUG treats DISEASE, etc, as relationship types. It can also include optionally, entities such as diabetes, asthma, heart disease as concrete entities of interest.