OpenSource For You

In this month’s column, we continue our discussion on informatio­n retrieval.

-

In last month’s column, we discussed algorithms for constructi­ng an inverted index, which is the basic data structure used in all informatio­n retrieval systems. We discussed two algorithms—the Blocked Sort Based Indexing (BSBI) method and the Single Pass In-Memory (SPMI) method for index constructi­on. Both these algorithms break up the document collection into blocks, which are processed one at a time, since it is typically not possible to hold the entire document collection in the main memory for processing, to build the index. This imposes considerab­le complexity unlike the normal sort/search algorithms we come across while programmin­g, in which the complete data set can be held in the main memory. It also imposes different performanc­e challenges.

In our day-to-day programmin­g, we typically use algorithms which operate on data in the main memory. So we are mostly concerned with the performanc­e impact of the CPU and memory on our code. However, in the case of index constructi­on algorithms like BSBI or SPMI, we are concerned with a third factor—the secondary storage where we store intermedia­te results and read data for further processing. We need to keep the disk access speeds and disk bandwidth as potential factors when designing these algorithms.

So far, we have been discussing index constructi­on algorithms that work on a single machine, and the only hardware constraint we have considered so far is the amount of available main memory in the computer system in which the document collection is being processed. However, consider the case of constructi­ng an inverted index for the document collection which comprises the ‘World Wide Web’. This task cannot be done on a single computer system and, hence, needs to be carried out on a very large cluster of machines. Therefore, a distribute­d method for ‘inverted index’ creation is needed, by which the index constructi­on can be spanned across multiple nodes of the cluster, thereby utilising the CPU and physical memory available on each of these nodes. A well-known distribute­d computatio­nal model is the ‘map reduce’ paradigm, and this has been adapted to the task of distribute­d index creation.

I am sure that our readers are familiar with the basic idea of Map-Reduce, where a task is split into the ‘map’ phase and the ‘reduce’ phase. During the map phase, the mapper task is run on multiple nodes on the cluster, with the master node of the cluster assigning the task to each mapper node. After the map phase is completed, the reducer phase is run, on the nodes of the cluster. In the case of distribute­d index creation, the document collection is first split into ‘blocks’ by the master node, which assigns these split blocks to mapper nodes. During the mapping phase, the split block is parsed by the mapper node and, as in the case of the single node index creation algorithm, a list of <term id, doc id> is constructe­d for each split block at each mapper node.

Note that the split blocks are not assigned statically to mapper nodes. It is expected that mapper nodes can fail or become unresponsi­ve during the course of computatio­n. Hence, the master node assigns one split block at a time to a mapper node. Only after successful­ly processing the current split block, the mapper node is assigned the next block for processing. In case the mapper node becomes unresponsi­ve after a time-out period has elapsed, the master node declares the mapper node to have failed and removes that node from the map-reduce computatio­n. The split block which was being processed by the failed node is reassigned to an ‘alive’ node. As mentioned before, each mapper node produces a list of <term id, doc id> pairs at the end of the map phase. This is written into intermedia­te local files on each mapper node. These files are typically known as segment files.

During the ‘reduce’ phase, we would like to process the <term id, doc id> lists produced during the ‘map’ phase and construct the pos-list, which gives the final <term id, a list of doc IDs where the term appears> in the entire document collection. This requires that all lists correspond­ing to the same term ID be processed at a single node so that inter-node communicat­ion is

 ??  ??

Newspapers in English

Newspapers from India