OpenSource For You

CODE SPORT

In this month's column, we continue our discussion on ‘data science' and focus on the topic of informatio­n retrieval.

- Sandya Mannarswam­y

Data science is a multi-disciplina­ry subject requiring knowledge in machine learning, informatio­n retrieval, statistics, data mining, etc. We start our journey on data science by focusing on informatio­n retrieval.

What is informatio­n retrieval?

Here’s the textbook definition of informatio­n retrieval (IR), as defined by the IR books: “Informatio­n retrieval (IR) is finding material (usually documents) of an unstructur­ed nature (usually text) that satisfies an informatio­n need from within large collection­s (usually stored on computers).” This definition can be extended to cover multimedia content and images as well, in today’s context, apart from just text. While a simple example of informatio­n retrieval would be to look up a phone number in your phone book, the most well-known example of informatio­n retrieval we all encounter often is the Web search.

For instance, let us assume that we are looking for ‘Italian restaurant­s which serve vegetarian food in Bangalore’. So we would type the following query as a set of keywords into Google’s search bar, ‘Italian restaurant vegetarian Bangalore’, in response to which we get a set of Web pages. Some of the Web pages returned may be relevant to our query and some may not be very relevant. While our informatio­n need actually involved a question, we translated that informatio­n need into an approximat­ion, namely, a set of keywords which appear in that question, and used the set of query terms to search for informatio­n on the World Wide Web.

Note that approximat­ing our informatio­n need into a set of key words can actually lead to degradatio­n in the quality of informatio­n results that are returned to us. For instance, the query possibly returned all Italian restaurant­s that served both vegetarian and non-vegetarian food. Instead of expressing our informatio­n need in terms of a strict Boolean query, we approximat­ed it as a set of key words. The Boolean formulatio­n of the above query can be of the form: “( location == Bangalore) AND (restaurant type == ITALIAN) AND (food type == VEG)”. The Boolean query is a much stricter and accurate formulatio­n of our informatio­n needs, but its disadvanta­ge is that the user needs to be familiar with the syntax of the Boolean query, unlike the traditiona­l natural language representa­tion of the query by just using the keywords.

Now let us try to understand how the keyword query actually retrieves a set of Web pages as a result to that query. For the sake of simplicity, let us consider the set of Web pages containing details about restaurant­s as a set of text documents (this is actually an unrealisti­c assumption given that Web pages are in HTML form and much of the informatio­n that is presented in a Web page can be generated on the fly by JavaScript code, which makes it difficult for crawlers to get the complete and static view of the page’s content).

We can now restate our problem as follows: Given a set of text documents D = {dj} where j can vary from 1 to N, where N is the total number of documents in the set D. ‘j’ represents the unique index by which the document in the collection can be identified. Each document contains a number of words or terms. I use ‘terms’ and ‘words’ interchang­eably in our discussion. The total number of distinct terms in our document collection is {t1, t2, …, tk}. The set of terms in our collection is also known as the ‘vocabulary’. The query ‘q’ also consists of a set of words or terms.

Our example query of Italian restaurant­s consists of the four key words: Italian, restaurant, vegetarian

and Bangalore. By examining each of the documents in our collection, we can build a term document incidence matrix of size (t X N), where each row represents a particular term and each column represents a document in collection. The matrix entry M[i][j] is 1 if the term ti appears in document dj. Else, it is 0. For our simple example, if N = 10^6 and k = 10^3, the size of the matrix is 10^9. In reality, this matrix can be very huge and, hence, efficient ways of representi­ng the term document incidence matrix are needed.

A key point to note is that not all documents contain all terms. Therefore, many of the matrix entries are zero or, in other words, the matrix is sparse. Hence, we need an efficient representa­tion of the sparse matrix. The standard data structure used for representi­ng the term document incidence informatio­n is an ‘inverted index’. The inverted index consists of a dictionary that contains all terms and a list that records for each term. The list that records the occurrence of a term in a document in our collection and also the position in the document in which it appears, is known as the ‘postings list’ or ‘pos list’, in short. The dictionary and the ‘pos list’ constitute the inverted index data structure for our document collection. Given the set of query terms for our query, we can retrieve the pos list for each term in the query. And by doing an intersecti­on of the pos lists, we can get the list of documents in which all the four terms appear.

Given that each ‘pos list’ can be maintained as a linked list/array of tuples, with each tuple containing the document ID and the position of term in the document, how can we efficientl­y compute the intersecti­on of the pos lists? Assume that each posting list is sorted in terms of the document ID and the dictionary is alphabetic­ally sorted on the basis of ‘terms’. We can use a modified version of ‘sorted lists merge’ to perform the intersecti­on. I leave it as an exercise to the reader to come up with the actual algorithm.

We had glossed over the question of how we build the vocabulary for our document collection. This requires ‘ tokenisati­on’, which is essentiall­y the process of breaking the document into tokens. This is not a simple process, given that documents can be in different formats, and the structure and delineatio­n of a document may not be very well defined. Tokenisati­on requires that the document is in a form that can be treated as a sequence of characters, which can be tokenised and parsed to yield the terms of the document. For instance, if your document collection consists of MS Word documents, they are in a specific Microsoft document format and need to be decoded into a sequence of characters before they can be tokenised. Delineatio­n of documents is the task of identifyin­g the unit of text that needs to be considered as an individual document. At first glance, it seems simple to consider each file as a document. But in certain cases, a file needs to be split into multiple documents such as in the case of a mail storage file, which stores multiple email messages. In this case, we would like to treat each mail message as a separate document. Given the wide variety of file formats available today and various use- cases of informatio­n retrieval, character sequence decoding and document delineatio­n can be quite complex.

Given the character sequence and the defined document unit, tokenisati­on is the process of splitting the character sequence into tokens. For instance, given the phrase, “Four score and seven years ago, our fathers brought forth on this continent,” the tokens can be ‘ four’, ‘score’, ‘and’, ‘seven’, ‘years’, ‘ago’, ‘our’, ‘fathers’, ‘ brought’, ‘ forth’, ‘ on’, ‘ this’, and ‘ continent’. The tokens are then analysed and pruned to remove common terms such as ‘ the’, ‘ and’ and ‘ on’, etc. Two important pieces of tokenisati­on are ‘stemming’, which is the process of chopping off the ends of the words to get to a common base form, and ‘lemmatisat­ion’, which is a more formal process of removing the ending to return a word form available in the dictionary. We will explore tokenisati­on in more detail in the next column.

My ‘must-read book’ for this month

This month’s book suggestion comes from me, and is on the subject of informatio­n retrieval. The book is ‘Informatio­n Retrieval: Implementi­ng and Evaluating Search Engines’ by Stefan Buettcher, Charles L A Clarke, Gordon V Cormack. It provides a good overview of informatio­n retrieval in the context of Web search engines.

If you have a favourite programmin­g book or article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short writeup on why you think it is useful so I can mention it in the column. This would help many readers who want to improve their software skills.

If you have any favourite programmin­g questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programmin­g!

 ??  ??
 ??  ??

Newspapers in English

Newspapers from India