In this month’s col­umn, we con­tinue our dis­cus­sion on de­tect­ing du­pli­cate ques­tions in com­mu­nity ques­tion an­swer­ing fo­rums.

OpenSource For You - - Contents - By: Sandya Man­nar­swamy The au­thor is an ex­pert in sys­tems soft­ware and is cur­rently work­ing as a re­search sci­en­tist at Con­duent Labs In­dia (for­merly Xerox In­dia Re­search Cen­tre). Her in­ter­ests in­clude com­pil­ers, pro­gram­ming lan­guages, file sys­tems and nat

Based on our read­ers’ re­quests to take up a real life ML/NLP prob­lem with a suf­fi­ciently large data set, we had started on the prob­lem of de­tect­ing du­pli­cate ques­tions in com­mu­nity ques­tion an­swer­ing (CQA) fo­rums us­ing the Quora Ques­tion Pair Dataset.

Let’s first de­fine our task as fol­lows: Given a pair of ques­tions <Q1, Q2>, the task is to iden­tify whether Q2 is a du­pli­cate of Q1, in the sense that, will the in­for­ma­tional needs ex­pressed in Q1 sat­isfy the in­for­ma­tional needs of Q2? In sim­pler terms, we can say that Q1 and Q2 are du­pli­cates from a lay per­son’s per­spec­tive if both of them are ask­ing the same thing in dif­fer­ent sur­face forms.

An al­ter­na­tive def­i­ni­tion is to con­sider that Q1 and Q2 are du­pli­cates if the an­swer to Q1 will also pro­vide the an­swer to Q2. How­ever, we will not con­sider the sec­ond def­i­ni­tion since we are con­cerned only with analysing the in­for­ma­tional needs ex­pressed in the ques­tions them­selves and have no ac­cess to an­swer text. There­fore, let’s de­fine our task as a bi­nary clas­si­fi­ca­tion prob­lem, where one of the two la­bels (du­pli­cate or non-du­pli­cate) needs to be pre­dicted for each given ques­tion pair, with the re­stric­tion that only the ques­tion text is avail­able for the task and not an­swer text.

As I pointed out in last month’s col­umn, a num­ber of NLP prob­lems are closely re­lated to du­pli­cate ques­tion de­tec­tion. The gen­eral con­sen­sus is that du­pli­cate ques­tion de­tec­tion can be solved as a by-prod­uct by us­ing these tech­niques them­selves. De­tect­ing se­man­tic text sim­i­lar­ity and recog­nis­ing tex­tual en­tail­ment are the clos­est in na­ture to that of du­pli­cate ques­tion de­tec­tion. How­ever, given that the goal of each of these prob­lems is dis­tinct from that of du­pli­cate ques­tion de­tec­tion, they fail to solve the lat­ter prob­lem ad­e­quately. Let me il­lus­trate this with a few ex­am­ple ques­tion pairs. Ex­am­ple 1

Q1a: What are the ways of in­vest­ing in the share mar­ket?

Q1b: What are the ways of in­vest­ing in the share mar­ket in In­dia?

One of the state-of-art tools avail­able on­line for de­tect­ing se­man­tic text sim­i­lar­ity is SEMILAR (­man­tic­sim­i­lar­ A freely avail­able state-of-art tool for en­tail­ment recog­ni­tion is Ex­cite­ment Open Platform or EOP (http://hlt­ser­­dex.php). SEMILAR gave a se­man­tic sim­i­lar­ity score of 0.95 for the above pair whereas EOP re­ported it as tex­tual en­tail­ment. How­ever, these two ques­tions have dif­fer­ent in­for­ma­tion needs and hence they are not du­pli­cates of each other.

Ex­am­ple 2

Q2a: In which year did McEn­roe beat Becker, who went on to be­come the youngest win­ner of the Wim­ble­don fi­nals?

Q2b: In which year did Becker beat McEn­roe and go on to be­come the youngest win­ner in the fi­nals at Wim­ble­don?

SEMILAR re­ported a sim­i­lar­ity score of

0.972 and EOP marked this ques­tion pair as en­tail­ment, in­di­cat­ing that Q2b is en­tailed from

Q2a. Again, these two ques­tions are about en­tirely two dif­fer­ent events, and hence are not du­pli­cates. We hy­poth­e­sise that hu­mans are quick to see the dif­fer­ence by ex­tract­ing the re­la­tions that are be­ing sought for in the two ques­tions. In Q2a, the re­la­tional event is “<McEn­roe (sub­ject), beat (pred­i­cate), Becker (ob­ject)> whereas in Q2b, the re­la­tional event is <Becker (sub­ject), beat (pred­i­cate), McEn­roe (ob­ject)> which is a dif­fer­ent re­la­tion from that in Q2a. By quickly scan­ning for a re­la­tional match/mis­match at the cross-sen­tence level, hu­mans quickly mark this as non-du­pli­cate,

even though there is con­sid­er­able tex­tual sim­i­lar­ity across the text pair. It is also pos­si­ble that the en­tail­ment sys­tem gets con­fused due to sub-classes be­ing en­tailed across the two ques­tions (namely, the clause, “Becker went on to be­come youngest win­ner”). This lends weight to our claim that while se­man­tic sim­i­lar­ity match­ing and tex­tual en­tail­ment are closely re­lated prob­lems to the du­pli­cate ques­tion de­tec­tion task, they can­not be used as so­lu­tions di­rectly for the du­pli­cate de­tec­tion prob­lem.

There are sub­tle but im­por­tant dif­fer­ences in the re­la­tions of en­ti­ties—cross-sen­tence word level interaction be­tween two sen­tences which mark them as non-du­pli­cates when ex­am­ined by hu­mans. We can hy­poth­e­sise that hu­mans use these ad­di­tional checks on top of the coarse grained sim­i­lar­ity com­par­i­son they do in their minds when they look at these ques­tions in iso­la­tion, and then ar­rive at the de­ci­sion of whether they are du­pli­cates or not. If we con­sider the ex­am­ple we dis­cussed in Q2a and Q2b, the fact is that the re­la­tion be­tween the en­ti­ties in Ques­tion 2a does not hold good in Ques­tion 2b and, hence, if this crosssen­tence level se­man­tic re­la­tions are checked, it would be pos­si­ble to de­ter­mine that this pair is not a du­pli­cate. It is also im­por­tant to note that not all mis­matches are equally im­por­tant. Let us con­sider an­other ex­am­ple.

Ex­am­ple 3

Q3a: Do omega-3 fatty acids, nor­mally avail­able as fish oil sup­ple­ments, help pre­vent can­cer?

Q3b: Do omega-3 fatty acids help pre­vent can­cer?

Though Q3b does not men­tion the fact that omega-3 fatty acids are typ­i­cally avail­able as fish oil sup­ple­ments, its in­for­ma­tion needs are sat­is­fied by the an­swer to Q3a, and hence these two ques­tions are du­pli­cates. From a hu­man per­spec­tive, we hy­poth­e­sise that the word frag­ment “nor­mally avail­able as fish oil sup­ple­ments” is not seen as es­sen­tial to the over­all se­man­tic com­po­si­tional mean­ing of Q3a; so we can quickly dis­card this in­for­ma­tion when we re­fine the over­all rep­re­sen­ta­tion of the first ques­tion when do­ing a pass over the sec­ond ques­tion. Also, we can hy­poth­e­sise that hu­mans use cross-sen­tence word level in­ter­ac­tions to quickly check whether sim­i­lar in­for­ma­tion needs are be­ing met in the two ques­tions.

Ex­am­ple 4

Q4a: How old was Becker when he won the first time at Wim­ble­don?

Q4b: What was Becker’s age when he was crowned as the youngest win­ner at Wim­ble­don?

Though the sur­face forms of the two ques­tions are quite dis­sim­i­lar, hu­mans tend to com­pare cross-sen­tence word level in­ter­ac­tions such as (<old, age>, <won, crowned>) in the con­text of the en­tity in ques­tion, namely, Becker to con­clude that these two ques­tions are du­pli­cates. Hence any sys­tem which at­tempts to solve the task of du­pli­cate ques­tion de­tec­tion should not de­pend blindly on a sin­gle ag­gre­gated coarse-grained sim­i­lar­ity mea­sure to com­pare the sen­tences, but in­stead should con­sider the fol­low­ing:

Do re­la­tions that ex­ist in the first ques­tion hold true for the sec­ond ques­tion?

Are there word level in­ter­ac­tions across the two ques­tions which cause them to have dif­fer­ent in­for­ma­tional needs (even if the rest of the ques­tion is pretty much iden­ti­cal across the two sen­tences)?

Now that we have a good idea of the re­quire­ments for a rea­son­able du­pli­cate ques­tion de­tec­tion sys­tem, let’s look at how we can start im­ple­ment­ing this so­lu­tion. For the sake of sim­plic­ity, let us as­sume that our data set con­sists of sin­gle sen­tence ques­tions. Our sys­tem for du­pli­cate de­tec­tion first needs to create a rep­re­sen­ta­tion for each in­put sen­tence and then feed the rep­re­sen­ta­tions for each of the two ques­tions to a clas­si­fier, which will decide whether they are du­pli­cates or not, by com­par­ing the rep­re­sen­ta­tions. The high-level block di­a­gram of such a sys­tem is shown in Fig­ure 1.

First, we need to create an in­put rep­re­sen­ta­tion for each ques­tion sen­tence. We have a num­ber of choices for this mod­ule. As is common in most neu­ral net­work based ap­proaches, we use word em­bed­dings to create a sen­tence rep­re­sen­ta­tion. We can ei­ther use pre-trained word em­bed­dings such as Word2Vec em­bed­dings/Glove em­bed­dings, or we can train our own word em­bed­dings us­ing the train­ing data as our cor­pus. For each word in a sen­tence, we look up its cor­re­spond­ing word em­bed­ding vec­tor and form the sen­tence ma­trix. Thus, each ques­tion (sen­tence) is rep­re­sented by its sen­tence ma­trix (a ma­trix whose rows rep­re­sent each word in the sen­tence and hence each row is the word-em­bed­ding vec­tor for that word). We now need to con­vert the sen­tence-em­bed­ding ma­trix into a fixed length in­put rep­re­sen­ta­tion vec­tor.

One of the pop­u­lar ways of rep­re­sent­ing an in­put sen­tence is by cre­at­ing a se­quence-to-se­quence rep­re­sen­ta­tion us­ing re­cur­rent neu­ral net­works. Given a se­quence of in­put words (this con­sti­tutes the sen­tence), we now pass this se­quence through a re­cur­rent neu­ral net­work (RNN) and create an out­put se­quence. While RNN gen­er­ates an out­put for each in­put in the se­quence, we are only in­ter­ested in the fi­nal ag­gre­gated rep­re­sen­ta­tion of the in­put se­quence. Hence, we take the out­put of the last unit of the RNN and use it as our sen­tence rep­re­sen­ta­tion. We can use ei­ther vanilla RNNs, or gated re­cur­rent units (GRU), or long short term me­mory (LSTM) units for cre­at­ing a fixed length rep­re­sen­ta­tion from a given in­put se­quence. Given that LSTMs have been quite suc­cess­fully used in many of the NLP tasks, we de­cided to use LSTMs to create the fixed length rep­re­sen­ta­tion of the ques­tion.

The last stage out­put from each of the two LSTMs

(one LSTM for each of the two ques­tions) rep­re­sents the in­put ques­tion rep­re­sen­ta­tion. We then feed the two rep­re­sen­ta­tions to a multi-layer per­cep­tron (MLP) clas­si­fier. An MLP clas­si­fier is noth­ing but a fully con­nected mul­ti­layer feed forward neu­ral net­work. Given that we have

a two-class pre­dic­tion prob­lem, the last stage of the MLP clas­si­fier is a two-unit soft­max, the out­put of which gives the prob­a­bil­i­ties for each of the two out­put classes. This is shown in the over­all block di­a­gram in Fig­ure 1.

Given that we dis­cussed the over­all struc­ture of our im­ple­men­ta­tion, I re­quest our read­ers to im­ple­ment this us­ing a deep learn­ing li­brary of their choice. I would rec­om­mend us­ing Ten­sorflow, PyTorch or Keras. We will dis­cuss the Ten­sorflow code for this prob­lem in next month’s col­umn. Here are a few ques­tions for our read­ers to con­sider in their im­ple­men­ta­tion:

How would you han­dle ‘out of vo­cab­u­lary’ words in the test data? Ba­si­cally, if there are words which do not have em­bed­dings in ei­ther Word2vec/Glove or even in the case of cor­pus-trained em­bed­ding, how would you rep­re­sent them?

Given that ques­tion sen­tences can be of dif­fer­ent lengths, how would you han­dle the vari­able length sen­tences? On what ba­sis would you decide how many hid­den lay­ers should be present in the MLP clas­si­fier and the num­ber of hid­den units in each layer?

I suggest that our read­ers (specif­i­cally those who have just started ex­plor­ing ML and NLP) can try im­ple­ment­ing the so­lu­tion and share the re­sults in a Python jupyter note­book. Please do send me the pointer to your note­book and we can dis­cuss it later in this col­umn.

If you have any favourite pro­gram­ming ques­tions/ soft­ware top­ics that you would like to dis­cuss on this fo­rum, please send them to me, along with your so­lu­tions and feed­back, at sandyas­m_AT_ya­hoo_DOT_­com. Wish­ing all our read­ers a very happy and pros­per­ous new year!

Sandya Man­nar­swamy

Fig­ure 1: Block di­a­gram for du­pli­cate ques­tion de­tec­tion sys­tem

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.