CODE SPORT

In this month’s column, we continue our discussion on information retrieval.

2014-03-10 - Sandya Mannarswamy

Last month, I focused on the basic problem of information retrieval, namely, ‘ad-hoc information retrieval’ where, given a collection of documents, the user can arbitrarily query for any information which may or may not be contained in the given document collection. One of the well-known examples of ad-hoc information retrieval systems is the Web search engine, using which the user can pose a query on an arbitrary topic, and the set of documents is all the searchable Web pages available on the ‘World Wide Web’.

A related but different kind of information retrieval is what is known as a ‘filtering’ task. In the case of filtering, a user profile is created, and a set of documents that may be of relevance or interest to the user are filtered from the document collection and presented to the user. A well-known example of information filtering is the personalisation of news delivery based on the user’s profile. Typically, in the case of the information filtering task, the set of documents in the document collection keeps changing dynamically (for instance, new news stories arrive and old news stories are retired from the collection), while the user query (for finding documents which are of interest based on the user profile) remains relatively stable. This is in contrast to ‘adhoc information retrieval’, wherein the document collection remains stable but queries keep changing dynamically.

In last month’s column, we discussed the Boolean model of ad-hoc information retrieval, wherein user information needs are presented in terms of a keyword-based query, and the documents in the collection are looked up using a Boolean model for the presence of the keywords in the query. For example, given that the document collection is the entire works of Shakespeare, the query term ‘Julius Caesar’ would find the documents that contain the query terms ‘Julius’ and ‘Caesar’ in them. In order to facilitate the queries, a data structure known as inverted index is generally built. The inverted index consists of the dictionary of terms in the document collection and a ‘pos list’ for each term, listing the document IDs which contain that term. Typically, the ‘pos list’ has not just the document IDs of the documents containing the term, but also contains the ‘position’ in the document where the term appears. Given the set of documents in the document collection, the pre-processing step is to construct the inverted index. So how do we construct this index?

As I mentioned before, each document is scanned and broken into tokens or terms. There are two ways of constructing an inverted index. The first one is known as the ‘Blocked Sort Based Indexing’ (BSBI) method. Till now, when building inverted indexes was discussed, we made an implicit assumption that the data can fit in the main memory. But that assumption does not hold true for large document collections and, obviously, for a Web search. Hence, we need to consider the case when the document collection cannot be analysed in its entirety in the main memory to construct the inverted index. That’s where BSBI comes in.

BSBI divides the document collection into blocks, where each block can fit in the main memory. It then analyses one block, gets the terms in that block, and creates a sorted list of (term ID, doc ID) pairs. We can think of the sorting as being done with the ‘term ID’ as the primary key and the ‘doc ID’ as the secondary key. Note that since the block can fit in main memory, the sorting can be done in the main memory for that block and the results of the sorting are written back to the stable storage. Once all the blocks are processed, we have an ‘inverted index’ corresponding to each block of the document collection on disk in intermediate files. These intermediate ‘inverted indices’ need to be merged in order to construct the final ‘inverted index’ for the document collection. Let us assume that we have divided the document collection into 10 blocks. After the intermediate index construction step, we have 10 intermediate index files on disk. Now, in the final merging process, we open the 10 files, read data from each file into a read buffer, and we open a write buffer for writing back the results. During each step, we select the smallest term ID among those present in the read buffer and process it. All the document IDs corresponding to the term ID are processed and added to the ‘pos list’ for that term, and written back to disk. We can view this process as the equivalent of merging 10 individual sorted lists into one final sorted list. I leave it as an exercise for the reader to compute the complexity associated with BSBI.

Note that BSBI is a multi-pass algorithm since it breaks up the document collection into multiple blocks, processes each block individually in a sorting phase, and then merges the sorted results of each block in the final step. The alternative to BSBI is the ‘Single Pass In-Memory Indexing’ (SPIMI). While BSBI processes each block and creates a sorted list of (term ID, doc ID) pairs, SPIMI does not have the pre-processing step of creating the sorted list of (term ID, doc ID) pairs.

Similar to BSBI, SPIMI also divides the collection into blocks. But the division into blocks is governed by the memory available. As long as there is enough free memory available, SPIMI continues to process one term at a time. On encountering a term, SPIMI checks to see if the term is already encountered as part of the current block’s dictionary. If not, it creates a new term entry in the dictionary and creates a ‘pos list’ associated with that term. If the term is already present in the dictionary, it retrieves the ‘pos list’ associated with that term and adds this ‘doc ID’ to the ‘pos list’. When there is not enough free memory, it stops the processing of terms, sorts the current terms dictionary, writes it to disk and starts the processing of the next block. After processing all the terms, there are a number of on-disk block dictionaries which need to be combined to create the final ‘inverted index’.

Note that, till now, we have been discussing the creation of an inverted index for a document collection in a single computer system, and the only hardware constraint we have considered so far is the amount of available main memory in the computer system where the document collection is being processed. However, consider the case of constructing an inverted index for the document collection that comprises the ‘World Wide Web’. This task cannot be done on a single computer system and needs to be carried out on a very large cluster of machines. Hence, a distributed ‘inverted index’ creation method is needed. This is typically done in search engines using the ‘Map Reduce’ paradigm, which I will cover in my next column.

In the earlier discussion on BSBI and SPIMI, we assumed that documents can be scanned to get a token stream, and each token can be considered as a term for the document collection. Similarly, a user query is also broken down into query terms, and we look for documents which contain the query terms using the inverted index. We have glossed over many of the issues that arise when creating the terms from documents or queries. Documents can typically contain many common words such as ‘the’, ‘an’, ‘a’, etc. Such words do not contribute to deciding which documents are more relevant to a user query. Hence, such common terms need to be eliminated from being added to the term dictionary. This is typically achieved by what is known as a ‘stop word list’. Any word in the ‘stop word list’ is skipped from being processed when it is encountered in a document.

There are further complications in tokenisation. How do we handle hyphenated terms such as ‘anti-discriminatory’ or ‘wellknown’, etc? Do we break them into individual pieces or treat them as a single term? How do we handle the case sensitivity of terms in document collection and in queries? Case sensitivity can be useful in distinguishing between two semantically different query terms such as ‘Windows’ and ‘windows’. For instance, ‘Windows’ could potentially refer to the Windows operating system, whereas ‘windows’ could refer to windows in a house. But maintaining case sensitivity in an inverted index makes the index bloat up and, hence, most search engines/information retrieval systems do not support case sensitivity.

So far we have been considering simple queries where the user types the exact query without any spelling mistakes. But, frequently, users do misspell the query terms. Misspelled queries need to be handled intelligently by information retrieval systems such as search engines. We have often seen search engines give us intelligent hints such as, “Did you mean ‘carrot’?” when the user misspells ‘carrot’ as ‘carot’. One way of handling misspelled queries is to compute the edit distance from the misspelled word to words closest to it in the dictionary, and then use that set of words as the search term. Misspelled queries become more difficult to detect and deal with for proper nouns, such as in cases like ‘Tom Cruise’ getting misspelled as ‘Tom Cruse’, etc. The major challenge is to identify the subset of words in the dictionary for which we want to compute the edit distance with the query term.

Another challenge for information retrieval systems is the issue of handling wild card queries. I leave it as a take-away question to the reader to come up with an algorithm for handling wild card queries.

My ‘must-read book’ for this month

This month’s book suggestion comes from one of our readers, Nita Kumar. Her recommendation is the book, ‘Beautiful Data: The Stories Behind Elegant Data Solutions’ by Toby Segaran and Jeff Hammerbacher. It comprises a series of essays on elegant data solutions encompassing a wide spectrum of topics such as opinion polls data analysis, data analysis of Mars images and data storage on the cloud. By the way, Toby Segaran is the author of the famous book, ‘Programming the Collective Intelligence’, which discusses how to mine the data on the Web. Thank you, Nita for your recommendation.

If you have a favourite programming book/article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short write-up on why you think it is useful, so I can mention it in this column. This would help many readers who want to improve their software skills.

If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming! The author is an expert in systems software and is currently working with Hewlett Packard India Ltd. Her interests include compilers, multi-core and storage systems. If you are preparing for systems software interviews, you may find it useful to visit Sandya's LinkedIn group Computer Science Interview Training India at http:// www.linkedin.com/groups?home=HYPERLINK "http://www. linkedin.com/groups?home=&gid=2339182"&HYPERLINK "http:// www.linkedin.com/groups?home=&gid=2339182"gid=2339182

CODE SPORT

In this month’s column, we continue our discussion on information retrieval.

Newspapers in English

Newspapers from India

CODE SPORT

In this month’s column, we continue our discussion on informatio­n retrieval.

Newspapers in English

Newspapers from India

In this month’s column, we continue our discussion on information retrieval.