In this month’s col­umn, we con­tinue our dis­cus­sion on in­for­ma­tion re­trieval.

OpenSource For You - - CODE SPORT -

In last month’s col­umn, we dis­cussed al­go­rithms for con­struct­ing an in­verted in­dex, which is the ba­sic data struc­ture used in all in­for­ma­tion re­trieval sys­tems. We dis­cussed two al­go­rithms—the Blocked Sort Based In­dex­ing (BSBI) method and the Sin­gle Pass In-Mem­ory (SPMI) method for in­dex con­struc­tion. Both these al­go­rithms break up the doc­u­ment collection into blocks, which are pro­cessed one at a time, since it is typ­i­cally not pos­si­ble to hold the en­tire doc­u­ment collection in the main mem­ory for pro­cess­ing, to build the in­dex. This im­poses con­sid­er­able com­plex­ity un­like the nor­mal sort/search al­go­rithms we come across while pro­gram­ming, in which the com­plete data set can be held in the main mem­ory. It also im­poses dif­fer­ent per­for­mance chal­lenges.

In our day-to-day pro­gram­ming, we typ­i­cally use al­go­rithms which op­er­ate on data in the main mem­ory. So we are mostly con­cerned with the per­for­mance im­pact of the CPU and mem­ory on our code. How­ever, in the case of in­dex con­struc­tion al­go­rithms like BSBI or SPMI, we are con­cerned with a third fac­tor—the sec­ondary stor­age where we store intermediate re­sults and read data for fur­ther pro­cess­ing. We need to keep the disk ac­cess speeds and disk band­width as po­ten­tial fac­tors when de­sign­ing these al­go­rithms.

So far, we have been dis­cussing in­dex con­struc­tion al­go­rithms that work on a sin­gle ma­chine, and the only hard­ware con­straint we have con­sid­ered so far is the amount of avail­able main mem­ory in the com­puter sys­tem in which the doc­u­ment collection is be­ing pro­cessed. How­ever, con­sider the case of con­struct­ing an in­verted in­dex for the doc­u­ment collection which com­prises the ‘World Wide Web’. This task can­not be done on a sin­gle com­puter sys­tem and, hence, needs to be car­ried out on a very large clus­ter of ma­chines. There­fore, a dis­trib­uted method for ‘in­verted in­dex’ cre­ation is needed, by which the in­dex con­struc­tion can be spanned across mul­ti­ple nodes of the clus­ter, thereby util­is­ing the CPU and phys­i­cal mem­ory avail­able on each of these nodes. A well-known dis­trib­uted com­pu­ta­tional model is the ‘map re­duce’ par­a­digm, and this has been adapted to the task of dis­trib­uted in­dex cre­ation.

I am sure that our read­ers are fa­mil­iar with the ba­sic idea of Map-Re­duce, where a task is split into the ‘map’ phase and the ‘re­duce’ phase. Dur­ing the map phase, the map­per task is run on mul­ti­ple nodes on the clus­ter, with the mas­ter node of the clus­ter as­sign­ing the task to each map­per node. Af­ter the map phase is com­pleted, the re­ducer phase is run, on the nodes of the clus­ter. In the case of dis­trib­uted in­dex cre­ation, the doc­u­ment collection is first split into ‘blocks’ by the mas­ter node, which as­signs these split blocks to map­per nodes. Dur­ing the map­ping phase, the split block is parsed by the map­per node and, as in the case of the sin­gle node in­dex cre­ation al­go­rithm, a list of <term id, doc id> is con­structed for each split block at each map­per node.

Note that the split blocks are not as­signed stat­i­cally to map­per nodes. It is ex­pected that map­per nodes can fail or be­come un­re­spon­sive dur­ing the course of com­pu­ta­tion. Hence, the mas­ter node as­signs one split block at a time to a map­per node. Only af­ter suc­cess­fully pro­cess­ing the cur­rent split block, the map­per node is as­signed the next block for pro­cess­ing. In case the map­per node be­comes un­re­spon­sive af­ter a time-out pe­riod has elapsed, the mas­ter node de­clares the map­per node to have failed and re­moves that node from the map-re­duce com­pu­ta­tion. The split block which was be­ing pro­cessed by the failed node is re­as­signed to an ‘alive’ node. As men­tioned be­fore, each map­per node pro­duces a list of <term id, doc id> pairs at the end of the map phase. This is writ­ten into intermediate lo­cal files on each map­per node. These files are typ­i­cally known as seg­ment files.

Dur­ing the ‘re­duce’ phase, we would like to process the <term id, doc id> lists pro­duced dur­ing the ‘map’ phase and con­struct the pos-list, which gives the fi­nal <term id, a list of doc IDs where the term ap­pears> in the en­tire doc­u­ment collection. This re­quires that all lists cor­re­spond­ing to the same term ID be pro­cessed at a sin­gle node so that in­ter-node com­mu­ni­ca­tion is

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.