In this month’s col­umn, we con­tinue our dis­cus­sion on de­tect­ing du­pli­cate ques­tions in com­mu­nity ques­tion-answering fo­rums.

OpenSource For You - - Contents - Sandya Man­nar­swamy

Let’s con­tinue ex­plor­ing the topic we started out on in last month’s col­umn, in which we dis­cussed the prob­lem of de­tect­ing du­pli­cate ques­tions in com­mu­nity ques­tion answering (CQA) fo­rums us­ing Quora’s ques­tion pair data set.

Given a pair of ques­tions <Q1, Q2>, the task is to iden­tify whether Q2 is a du­pli­cate of Q1. Our sys­tem for du­pli­cate de­tec­tion first needs to cre­ate a rep­re­sen­ta­tion for each in­put sen­tence, and then feed the rep­re­sen­ta­tions for each of the two ques­tions to a clas­si­fier which will de­cide whether they are du­pli­cates or not by com­par­ing the rep­re­sen­ta­tions.

In this month’s col­umn, I will pro­vide some of the skele­ton code func­tions which can help to im­ple­ment this so­lu­tion. I have de­lib­er­ately not pro­vided the com­plete code for the prob­lem, as I would like read­ers to build their own so­lu­tions and be­come fa­mil­iar with cre­at­ing sim­ple neu­ral net­work mod­els from scratch.

As dis­cussed in last month’s col­umn, while there are mul­ti­ple meth­ods of cre­at­ing a sen­tence rep­re­sen­ta­tion, we can sim­ply use a con­cate­na­tion of word em­bed­dings of the in­di­vid­ual words in the ques­tion sen­tence to cre­ate a ques­tion embed­ding rep­re­sen­ta­tion. We can ei­ther learn the word em­bed­dings spe­cific for the task of du­pli­cate ques­tion-de­tec­tion (pro­vided our cor­pus is large and gen­eral enough), or we can use pre-trained word em­bed­dings such as Word2vec.

In our ex­am­ple, we will use Word2vec em­bed­dings and con­cate­nate the em­bed­dings of in­di­vid­ual words to rep­re­sent the ques­tion sen­tence. As­sum­ing that we use Word2vec em­bed­dings of di­men­sion D, each in­put ques­tion can be rep­re­sented by (NXD), where N is the num­ber of words in the ques­tion and D is the embed­ding size. We need to feed in the in­put rep­re­sen­ta­tion for each of the two ques­tions we are com­par­ing to the neu­ral net­work model.

Be­fore we look at the ac­tual code for the neu­ral net­work model, we need to de­cide which deep learn­ing frame­work we would use to im­ple­ment this model. There are a num­ber of choices such as Theano, MxNet, Keras, CNTK, Torch, PyTorch, Caffe, Ten­sorflow, etc. A brief com­par­i­son of some of the pop­u­lar deep learn­ing frame­works is cov­ered in the ar­ti­cle at https:// deeplearn­­pare­dl4j­ten­sorflow­pytorch.

In se­lect­ing a deep learn­ing frame­work for a project, one needs to con­sider both ease of pro­gram­ming, main­tain­abil­ity of code and long-term sup­port for the frame­work. Some of the frame­works, such as Theano, are from aca­demic groups; hence their sup­port may be time-lim­ited. For this project, we de­cided to go with Ten­sorFlow, given its wide­spread adop­tion in the in­dus­try (it is spon­sored by Google) and its ease of use. We will as­sume that our read­ers are fa­mil­iar with Ten­sorFlow (a quick in­tro­duc­tion to it can be found at https://www.ten­ started/get_s­tarted).

Let us as­sume that we have a bi­nary file which con­tains the word to embed­ding map­ping for the Word2vec em­bed­dings (pre­trained word em­bed­dings are avail­able from ei­ther ar­chive/p/word2vec/ if you want to use the Word2vec model or https://nlp.stan­ if you want to use the Glove vec­tors).

First, let’s read our train­ing cor­pus and build a vo­cab­u­lary list that con­tains all the words in our train­ing cor­pus. Next, let’s build a map which maps each word to a valid Word2vec embed­ding. Shown be­low is the skele­ton code for this:

def cre­ate_­word_map(w2v_­file, vo­cab_list): data = np.load(w2v_­file) glove_ar­ray = data[‘glove’] for word in vo­cab_list: idx = vo­­dex(word) w2v_­wordmap[word] = glove_ar­ray[idx]

Next, we will cre­ate the sen­tence ma­trix as­so­ci­ated with each ques­tion, as fol­lows:

def ques­tion2se­quence(ques): to­kens = get_­to­kens(ques) rows = [] #rep­re­sents the sen­tence embed­ding ma­trix #Greedy search for to­kens for to­ken in to­kens:

as­sert (to­ken in w2v_­wordmap), word + “not found in w2v_map”


if len(to­kens) < MAX_QN_LEN:

#ques­tion is too short

#we need to pad the ques­tion up to max se­quence length j = MAX_QN_LEN - len(to­kens) word = _UNK while j > 0: rows.ap­pend(w2v_­wordmap[word]) j = j -1 re­turn rows

Now, let’s con­vert this sen­tence ma­trix into a fixed-length vec­tor rep­re­sen­ta­tion of the ques­tion. Given a se­quence of in­put words (this con­sti­tutes the ques­tion sen­tence), we now pass this se­quence through a re­cur­rent neu­ral net­work (RNN) and cre­ate an out­put se­quence. We can use ei­ther vanilla RNNs, gated re­cur­rent units (GRU) or long short term mem­ory (LSTM) units for cre­at­ing a fixed-length rep­re­sen­ta­tion from a given in­put se­quence. Given that LSTMs have been quite suc­cess­fully used in many of the NLP tasks, we de­cided to use them to cre­ate the fixed-length rep­re­sen­ta­tion of the ques­tion.

Im­port ten­sorflow as tf ques­tion1 =­holder(tf.float32, [N, l_q1, D], ‘ques­tion1’) lst­m_­cell = tf.con­trib.rnn.Ba­sicLSTMCell(num_l­st­m_u­nits) value, state = tf.nn.dy­nam­ic_rnn(lst­m_­cell, ques­tion1, dtype=tf.float32)

While RNN gen­er­ates an out­put for each in­put in the se­quence, we are only in­ter­ested in the fi­nal ag­gre­gated rep­re­sen­ta­tion of the in­put se­quence. Hence, we take the out­put of the LSTM at the last time step and use it as our sen­tence rep­re­sen­ta­tion. Note that the last time step cor­re­sponds to the last word in the sen­tence be­ing fed to the LSTM. Hence, the LSTM out­put cor­re­sponds to an ag­gre­gated rep­re­sen­ta­tion of the cur­rent word and all

the words that come be­fore it. Hence, it rep­re­sents the com­plete sen­tence.

value1 = tf.trans­pose(value, [1, 0, 2]) lst­m_out­put = tf.gather(value1, int(value1.get_shape()[0]) - 1)

Just as we ob­tained a fixed-length rep­re­sen­ta­tion for Ques­tion 1, we also cre­ated a fixed-length rep­re­sen­ta­tion for Ques­tion 2 us­ing a se­cond LSTM. The last stage out­put from each of the two LSTMs (one LSTM for each of the two ques­tions) rep­re­sents the in­put ques­tion rep­re­sen­ta­tion. We can then con­cate­nate th­ese two rep­re­sen­ta­tions and feed it to the mul­ti­layer per­cep­tron clas­si­fier.

An MLP clas­si­fier is noth­ing but a fully con­nected mul­ti­layer feed for­ward neu­ral net­work. Given that we have a two-class pre­dic­tion prob­lem, the last stage of the MLP clas­si­fier is a two-unit soft­max, whose out­put gives the prob­a­bil­i­ties for each of the two out­put classes. Here is the skele­tal code for an MLP clas­si­fier with three densely con­nected feed for­ward lay­ers, with 256, 128 and two units each:

pre­dic­t_lay­er_one_out = tf.lay­ers.dense(lst­m_out­put,



name=”pre­dic­tion_lay­er_one”) pre­dic­t_lay­er_t­wo_out = tf.lay­ers.dense(dropout_pre­dict_ lay­er_one_out,



name=”pre­dic­tion_lay­er_two”) pre­dic­t_lay­er_log­its = tf.lay­ers.dense(pre­dic­t_lay­er_t­wo_out,

units=2, name=”fi­nal_out­put_layer”)

Here are a cou­ple of ques­tions for our read­ers to think about:

How do you de­cide on the num­ber of hid­den lay­ers and the num­ber of units in each hid­den layer for your MLP clas­si­fier?

In our prob­lem, we are do­ing bi­nary clas­si­fi­ca­tion as we need to pre­dict whether a ques­tion is du­pli­cate or not. How would this code change if you had to pre­dict one out of K dif­fer­ent classes (for ex­am­ple, if you are try­ing to pre­dict which cat­e­gory a par­tic­u­lar ques­tion may be­long to)?

The last layer out­put is used as in­put to a Ten­sorFlow net­work loss com­pu­ta­tion node, which com­putes the crossen­tropy loss us­ing the ground truth la­bels as shown be­low:

loss =­duce_mean(tf.nn.soft­max_cross_en­tropy­_with­_log­its( log­its=pre­dict_

lay­er_log­its, la­bels=la­bels)) op­ti­mizer = tf.train.AdamOp­ti­mizer().min­i­mize(loss)

The net­work is then trained with ground truth la­bels dur­ing the train­ing phase to se­lect net­work weights such that cross-en­tropy loss is min­imised.

We also need to add code that can com­pute the ac­cu­racy dur­ing the train­ing phase.

As we had dis­cussed in ear­lier columns on neu­ral net­works, gra­di­ent des­cent tech­niques are typ­i­cally used to learn the net­work pa­ram­e­ters/weights. Typ­i­cally, batch gra­di­ent des­cent is used for weight up­dates dur­ing the train­ing process. Here are a cou­ple of ques­tions for our read­ers:

Why do we pre­fer batch gra­di­ent des­cent over full gra­di­ent des­cent or sto­chas­tic gra­di­ent des­cent?

How do you choose a good batch size for your im­ple­men­ta­tion?

Once we have built the neu­ral net­work model, we can train our model with la­belled ex­am­ples. Re­mem­ber that each train­ing loop con­sists of go­ing over all the train­ing sam­ples once. This is typ­i­cally known as an ‘epoch’. Each epoch con­sists of sev­eral batch-sized runs, wherein at the end of each batch, the gra­di­ents com­puted are used to up­date the net­work weights/pa­ram­e­ters. In order to en­sure that the net­work is learn­ing cor­rectly, we need to mea­sure the to­tal loss at the end of each epoch and ver­ify that the to­tal loss is de­creas­ing at the end of each suc­ces­sive epoch.

Now we need to de­cide when we should stop the train­ing process. One sim­ple but naïve way of stop­ping the train­ing process is af­ter a fixed num­ber of epochs. Another op­tion is to stop train­ing af­ter we have reached a train­ing ac­cu­racy of 100 per cent.

Here is a ques­tion for our read­ers: While we can de­cide to stop train­ing af­ter some fixed num­ber of epochs or af­ter a train­ing ac­cu­racy of 100 per cent, what are the dis­ad­van­tages as­so­ci­ated with each of th­ese ap­proaches? We will dis­cuss more on this topic as well as on the in­fer­ence phase in next month’s col­umn.

If you have any favourite pro­gram­ming ques­tions/soft­ware top­ics that you would like to dis­cuss on this fo­rum, please send them to me, along with your so­lu­tions and feed­back, at sandyas­m_AT_ya­hoo_DOT_­com. Wish­ing all our read­ers happy cod­ing un­til next month!

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.