CodeS­port

In this month’s col­umn, we dis­cuss a new text min­ing task called ma­chine read­ing com­pre­hen­sion.

OpenSource For You - - Contents - Sandya Man­nar­swamy

Over the past cou­ple of months, we have been dis­cussing the prob­lem of de­tect­ing du­pli­cate ques­tions in com­mu­nity ques­tion an­swer­ing (CQA) fo­rums us­ing the Quora ques­tion pair data set. In this month’s col­umn, let’s dis­cuss a closely re­lated prob­lem, namely, ma­chine read­ing com­pre­hen­sion.

This is an im­por­tant lan­guage-re­lated task in ar­ti­fi­cial in­tel­li­gence (AI).

Ar­ti­fi­cial in­tel­li­gence has seen rapid progress over the last few years, with a num­ber of state-of-art ad­vances in vi­sion, speech and lan­guage. Com­put­ing in­tel­li­gence has evolved from sim­ple num­ber crunch­ing tasks to per­cep­tual com­put­ing tasks, whereby ma­chines can take sen­sory in­puts from the sur­round­ing en­vi­ron­ment such as vi­sion, speech, etc. The next step in the evo­lu­tion of com­put­ing in­tel­li­gence is cog­ni­tive in­tel­li­gence com­put­ing, where ma­chines can read lan­guage/text, un­der­stand mean­ings, rea­son based on the in­for­ma­tion they have read, and make de­ci­sions even in com­plex sce­nar­ios, on par with hu­man in­tel­li­gence. This re­quires that ma­chines have the abil­ity to read text, un­der­stand it, ex­tract mean­ing from it and rea­son, based on its un­der­stand­ing.

There are a num­ber of lan­guage re­lated tasks per­tain­ing to AI. These in­clude in­for­ma­tion ex­trac­tion, in­for­ma­tion search, di­a­logue, and of course ques­tion an­swer­ing (QA). It is pos­si­ble to frame all the other tasks as QA. For in­stance, in­for­ma­tion search can be framed as ask­ing ques­tions/queries, with the ma­chines re­triev­ing the rel­e­vant an­swer in an in­ter­ac­tive fash­ion from the Web/ ex­ter­nal knowl­edge bases. Di­a­logue can be framed as con­ver­sa­tional QA, of which a pop­u­lar ex­am­ple is Ap­ple’s Siri. Ques­tion an­swer­ing is an AI com­plete prob­lem, in the sense that find­ing a good so­lu­tion to QA will en­able so­lu­tions to other lan­guage re­lated AI tasks.

QA be­came a pop­u­lar AI task when IBM demon­strated it in the AI Grand Chal­lenge. In the pop­u­lar TV game show ‘Jeop­ardy’, the ma­chine in­tel­li­gence of IBM Wat­son was pit­ted against hu­man con­tes­tants Ken Jen­nings and Brad Rit­ter. IBM Wat­son used many of the nat­u­ral lan­guage pro­cess­ing tech­niques such as ‘sim­i­lar ques­tion retrieval’, sum­mari­sa­tion, key pas­sage iden­ti­fi­ca­tion and an­swer sen­tence ex­trac­tion to beat the hu­mans. This was pos­si­bly the first well-known demon­stra­tion of ma­chines per­form­ing the QA task and fu­elled the pub­lic’s imag­i­na­tion.

Ma­chine QA has a num­ber of flavours. Given a ques­tion, the ma­chine can look through ex­ter­nal knowl­edge bases and try to find an an­swer. For this prob­lem, it needs to con­vert the nat­u­ral lan­guage ques­tion into a log­i­cal form, which can be used to query the ex­ter­nal knowl­edge base. A par­tic­u­lar NLP tech­nique known as se­man­tic pars­ing is used to con­vert a nat­u­ral lan­guage query into a log­i­cal form, spe­cific to the schema of the ex­ter­nal knowl­edge base. The log­i­cal query can be used to re­trieve re­sults from the ex­ter­nal knowl­edge base. While this can be used for struc­tured ex­ter­nal knowl­edge bases, there are a num­ber of un­struc­tured text doc­u­ments that con­tain the in­for­ma­tion re­quired to an­swer the ques­tions. For in­stance, many of the fac­tual ques­tions can be an­swered us­ing Wikipedia, which con­tains a num­ber of ar­ti­cles on dif­fer­ent top­ics. Hence, an­swer­ing ques­tions re­lated to each of these top­ics re­quires read­ing the Wikipedia text. The ma­chines can then an­swer ques­tions that are based on the text read. This has given rise to a spe­cific form of AI QA task known as ma­chine read­ing com­pre­hen­sion (MRC).

Here is an MRC task: Given a pas­sage P, and a ques­tion Q re­lated to that pas­sage, the QA sys­tem needs to pro­vide an­swer A to the given ques­tion. There can be more than one ques­tion, as is typ­i­cally the case in read­ing com­pre­hen­sion tasks. There can be two vari­ants to this task. The sim­plest vari­ant is the an­swer ex­trac­tion task, where the ex­act an­swer A to ques­tion Q is present in the pas­sage P. This is known as the an­swer ex­trac­tion task. A more com­pli­cated vari­ant is the an­swer gen­er­a­tion task. In this case, the ex­act an­swer is not present di­rectly in the pas­sage P. In­stead, the pro­gram has to syn­the­sise/gen­er­ate

the an­swer A based on the in­for­ma­tion con­tained in the pas­sage P. This is known as the an­swer gen­er­a­tion task.

Here is an ex­am­ple of the an­swer ex­trac­tion task:

The pas­sage in the ta­ble above shows the rel­e­vant sen­tences marked for an­swer­ing the ques­tion Q. Note that there are a num­ber of lo­ca­tions/cities men­tioned in the pas­sage P. Hence, in or­der to cor­rectly an­swer the ques­tion, “Which city is Alyssa in?” the pro­gram needs to make sense of the whole para­graph first be­fore it can iden­tify the rel­e­vant an­swer, which is, “She is now in Mi­ami.” From the rel­e­vant sen­tence, it should then ex­tract the ex­act an­swer, namely ‘Mi­ami’. If you look care­fully at the pas­sage P, it does not men­tion the word ‘city’ any­where. How­ever, the ques­tion Q con­tains the word ‘city’. When the pro­gram re­trieves the an­swer as ‘Mi­ami’, it needs to make sure that this is ac­tu­ally a city and not a state, county or con­ti­nent. While named en­tity recog­ni­tion can help clas­sify Mi­ami as the lo­ca­tion, the pro­gram will need knowl­edge about the ex­ter­nal world, in­clud­ing city names, to con­clude that Mi­ami is def­i­nitely a city.

Given that MRC is an im­por­tant AI task, what are the data sets avail­able for this prob­lem? Prior to 2015, the only good data set avail­able for the MRC task was the MCTest data set. This data set had only 2600 ques­tions in it, which made it dif­fi­cult to use deep learn­ing ap­proaches for this task. In re­cent years, a num­ber of new MRC data sets have been re­leased. These in­clude the Stan­ford Ques­tion An­swer­ing Dataset (pop­u­larly known as SQuAD), the CNN/Dai­lyMail QA data set, the NewsQA data set, and Mi­crosoft’s MS-MARCO data set. Of these, SQuAD is one of the most pop­u­lar and a num­ber of deep learn­ing ma­chine com­pre­hen­sion ap­proaches have been ap­plied to it.

SQuAD con­tains more than 200,000 ques­tions with pas­sages re­trieved from Wikipedia ar­ti­cles. SquAD’s data set is a ma­chine com­pre­hen­sion data set for the an­swer ex­trac­tion task. You can ex­plore SQuAD at https://ra­jpurkar.github.io/ SQuAD-ex­plorer/. The SQuAD leaderboard also shows the links to sev­eral of the tech­niques that have been ap­plied to it. A pop­u­lar data set for the an­swer gen­er­a­tion vari­ant of the MRC task is the MS-MARCO data set.

Now that we have un­der­stood the MRC prob­lem, and the data sets that are avail­able, let us quickly think about how we can ap­proach this prob­lem. Read­ing com­pre­hen­sion is a task done eas­ily by most hu­mans. We gen­er­ally do a quick parse of the pas­sage P first, then go over the ques­tion Q once, and again scan the pas­sage P to find the rel­e­vant sen­tences to an­swer Q. We can de­sign a sim­ple base­line neu­ral QA sys­tem to solve the MRC task based on a sim­i­lar ap­proach.

As we did in the du­pli­cate ques­tion de­tec­tion task, we first cre­ate a fixed length rep­re­sen­ta­tion of ques­tion Q by pass­ing it through a re­cur­rent neu­ral net­work such as LSTM. We also build a fixed length rep­re­sen­ta­tion of pas­sage P by pass­ing it through an­other LSTM. We can then com­bine the two fixed length rep­re­sen­ta­tions through an en­coder. The com­bined rep­re­sen­ta­tion is then used to pre­dict the an­swer lo­ca­tion in the pas­sage. This sim­ple ap­proach is shown in Fig­ure 1.

While this ap­proach is con­cep­tu­ally sim­ple and easy to un­der­stand, it has a num­ber of draw­backs. By first en­cod­ing the ques­tion and the pas­sage into fixed length rep­re­sen­ta­tions, we com­press the in­for­ma­tion con­tained in the text. This re­sults in loss of in­for­ma­tion in match­ing the ques­tion to the rel­e­vant sen­tences which con­tain the an­swer in the pas­sage P. One way of solv­ing this prob­lem is to adopt a two-stage ap­proach. The first stage is to an­a­lyse the pas­sage with the ques­tion Q to iden­tify the key sup­port­ing fact in the pas­sage P which can help an­swer the ques­tion. In the sec­ond stage, we build a rep­re­sen­ta­tion of the pas­sage where the key sup­port­ing fact is given the high­est weigh­tage among all the sen­tences con­tained in it. The rep­re­sen­ta­tion of the en­coded pas­sage is then com­bined with the ques­tion and passed through the de­coder, to pre­dict the an­swer words present in the pas­sage. This tech­nique is known as ‘mem­ory net­works’ and we will dis­cuss it in de­tail in next month’s col­umn.

If you have any favourite pro­gram­ming ques­tions/soft­ware top­ics that you would like to dis­cuss on this fo­rum, please send them to me, along with your so­lu­tions and feed­back, at sandyas­m_AT_ya­hoo_DOT_­com. Wish­ing all our read­ers happy cod­ing un­til next month!

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.