CODE SPORT
In this month's column, we continue our discussion on ‘data science' and focus on the topic of information retrieval.
Data science is a multi-disciplinary subject requiring knowledge in machine learning, information retrieval, statistics, data mining, etc. We start our journey on data science by focusing on information retrieval.
What is information retrieval?
Here’s the textbook definition of information retrieval (IR), as defined by the IR books: “Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).” This definition can be extended to cover multimedia content and images as well, in today’s context, apart from just text. While a simple example of information retrieval would be to look up a phone number in your phone book, the most well-known example of information retrieval we all encounter often is the Web search.
For instance, let us assume that we are looking for ‘Italian restaurants which serve vegetarian food in Bangalore’. So we would type the following query as a set of keywords into Google’s search bar, ‘Italian restaurant vegetarian Bangalore’, in response to which we get a set of Web pages. Some of the Web pages returned may be relevant to our query and some may not be very relevant. While our information need actually involved a question, we translated that information need into an approximation, namely, a set of keywords which appear in that question, and used the set of query terms to search for information on the World Wide Web.
Note that approximating our information need into a set of key words can actually lead to degradation in the quality of information results that are returned to us. For instance, the query possibly returned all Italian restaurants that served both vegetarian and non-vegetarian food. Instead of expressing our information need in terms of a strict Boolean query, we approximated it as a set of key words. The Boolean formulation of the above query can be of the form: “( location == Bangalore) AND (restaurant type == ITALIAN) AND (food type == VEG)”. The Boolean query is a much stricter and accurate formulation of our information needs, but its disadvantage is that the user needs to be familiar with the syntax of the Boolean query, unlike the traditional natural language representation of the query by just using the keywords.
Now let us try to understand how the keyword query actually retrieves a set of Web pages as a result to that query. For the sake of simplicity, let us consider the set of Web pages containing details about restaurants as a set of text documents (this is actually an unrealistic assumption given that Web pages are in HTML form and much of the information that is presented in a Web page can be generated on the fly by JavaScript code, which makes it difficult for crawlers to get the complete and static view of the page’s content).
We can now restate our problem as follows: Given a set of text documents D = {dj} where j can vary from 1 to N, where N is the total number of documents in the set D. ‘j’ represents the unique index by which the document in the collection can be identified. Each document contains a number of words or terms. I use ‘terms’ and ‘words’ interchangeably in our discussion. The total number of distinct terms in our document collection is {t1, t2, …, tk}. The set of terms in our collection is also known as the ‘vocabulary’. The query ‘q’ also consists of a set of words or terms.
Our example query of Italian restaurants consists of the four key words: Italian, restaurant, vegetarian
and Bangalore. By examining each of the documents in our collection, we can build a term document incidence matrix of size (t X N), where each row represents a particular term and each column represents a document in collection. The matrix entry M[i][j] is 1 if the term ti appears in document dj. Else, it is 0. For our simple example, if N = 10^6 and k = 10^3, the size of the matrix is 10^9. In reality, this matrix can be very huge and, hence, efficient ways of representing the term document incidence matrix are needed.
A key point to note is that not all documents contain all terms. Therefore, many of the matrix entries are zero or, in other words, the matrix is sparse. Hence, we need an efficient representation of the sparse matrix. The standard data structure used for representing the term document incidence information is an ‘inverted index’. The inverted index consists of a dictionary that contains all terms and a list that records for each term. The list that records the occurrence of a term in a document in our collection and also the position in the document in which it appears, is known as the ‘postings list’ or ‘pos list’, in short. The dictionary and the ‘pos list’ constitute the inverted index data structure for our document collection. Given the set of query terms for our query, we can retrieve the pos list for each term in the query. And by doing an intersection of the pos lists, we can get the list of documents in which all the four terms appear.
Given that each ‘pos list’ can be maintained as a linked list/array of tuples, with each tuple containing the document ID and the position of term in the document, how can we efficiently compute the intersection of the pos lists? Assume that each posting list is sorted in terms of the document ID and the dictionary is alphabetically sorted on the basis of ‘terms’. We can use a modified version of ‘sorted lists merge’ to perform the intersection. I leave it as an exercise to the reader to come up with the actual algorithm.
We had glossed over the question of how we build the vocabulary for our document collection. This requires ‘ tokenisation’, which is essentially the process of breaking the document into tokens. This is not a simple process, given that documents can be in different formats, and the structure and delineation of a document may not be very well defined. Tokenisation requires that the document is in a form that can be treated as a sequence of characters, which can be tokenised and parsed to yield the terms of the document. For instance, if your document collection consists of MS Word documents, they are in a specific Microsoft document format and need to be decoded into a sequence of characters before they can be tokenised. Delineation of documents is the task of identifying the unit of text that needs to be considered as an individual document. At first glance, it seems simple to consider each file as a document. But in certain cases, a file needs to be split into multiple documents such as in the case of a mail storage file, which stores multiple email messages. In this case, we would like to treat each mail message as a separate document. Given the wide variety of file formats available today and various use- cases of information retrieval, character sequence decoding and document delineation can be quite complex.
Given the character sequence and the defined document unit, tokenisation is the process of splitting the character sequence into tokens. For instance, given the phrase, “Four score and seven years ago, our fathers brought forth on this continent,” the tokens can be ‘ four’, ‘score’, ‘and’, ‘seven’, ‘years’, ‘ago’, ‘our’, ‘fathers’, ‘ brought’, ‘ forth’, ‘ on’, ‘ this’, and ‘ continent’. The tokens are then analysed and pruned to remove common terms such as ‘ the’, ‘ and’ and ‘ on’, etc. Two important pieces of tokenisation are ‘stemming’, which is the process of chopping off the ends of the words to get to a common base form, and ‘lemmatisation’, which is a more formal process of removing the ending to return a word form available in the dictionary. We will explore tokenisation in more detail in the next column.
My ‘must-read book’ for this month
This month’s book suggestion comes from me, and is on the subject of information retrieval. The book is ‘Information Retrieval: Implementing and Evaluating Search Engines’ by Stefan Buettcher, Charles L A Clarke, Gordon V Cormack. It provides a good overview of information retrieval in the context of Web search engines.
If you have a favourite programming book or article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short writeup on why you think it is useful so I can mention it in the column. This would help many readers who want to improve their software skills.
If you have any favourite programming questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!