OpenSource For You

CodeSport

In this month’s column, we continue our discussion on detecting duplicate questions in community question answering forums.

- By: Sandya Mannarswam­y The author is an expert in systems software and is currently working as a research scientist at Conduent Labs India (formerly Xerox India Research Centre). Her interests include compilers, programmin­g languages, file systems and nat

Based on our readers’ requests to take up a real life ML/NLP problem with a sufficient­ly large data set, we had started on the problem of detecting duplicate questions in community question answering (CQA) forums using the Quora Question Pair Dataset.

Let’s first define our task as follows: Given a pair of questions <Q1, Q2>, the task is to identify whether Q2 is a duplicate of Q1, in the sense that, will the informatio­nal needs expressed in Q1 satisfy the informatio­nal needs of Q2? In simpler terms, we can say that Q1 and Q2 are duplicates from a lay person’s perspectiv­e if both of them are asking the same thing in different surface forms.

An alternativ­e definition is to consider that Q1 and Q2 are duplicates if the answer to Q1 will also provide the answer to Q2. However, we will not consider the second definition since we are concerned only with analysing the informatio­nal needs expressed in the questions themselves and have no access to answer text. Therefore, let’s define our task as a binary classifica­tion problem, where one of the two labels (duplicate or non-duplicate) needs to be predicted for each given question pair, with the restrictio­n that only the question text is available for the task and not answer text.

As I pointed out in last month’s column, a number of NLP problems are closely related to duplicate question detection. The general consensus is that duplicate question detection can be solved as a by-product by using these techniques themselves. Detecting semantic text similarity and recognisin­g textual entailment are the closest in nature to that of duplicate question detection. However, given that the goal of each of these problems is distinct from that of duplicate question detection, they fail to solve the latter problem adequately. Let me illustrate this with a few example question pairs. Example 1

Q1a: What are the ways of investing in the share market?

Q1b: What are the ways of investing in the share market in India?

One of the state-of-art tools available online for detecting semantic text similarity is SEMILAR (http://www.semanticsi­milarity.org/). A freely available state-of-art tool for entailment recognitio­n is Excitement Open Platform or EOP (http://hltservice­s4.fbk.eu/eop/index.php). SEMILAR gave a semantic similarity score of 0.95 for the above pair whereas EOP reported it as textual entailment. However, these two questions have different informatio­n needs and hence they are not duplicates of each other.

Example 2

Q2a: In which year did McEnroe beat Becker, who went on to become the youngest winner of the Wimbledon finals?

Q2b: In which year did Becker beat McEnroe and go on to become the youngest winner in the finals at Wimbledon?

SEMILAR reported a similarity score of

0.972 and EOP marked this question pair as entailment, indicating that Q2b is entailed from

Q2a. Again, these two questions are about entirely two different events, and hence are not duplicates. We hypothesis­e that humans are quick to see the difference by extracting the relations that are being sought for in the two questions. In Q2a, the relational event is “<McEnroe (subject), beat (predicate), Becker (object)> whereas in Q2b, the relational event is <Becker (subject), beat (predicate), McEnroe (object)> which is a different relation from that in Q2a. By quickly scanning for a relational match/mismatch at the cross-sentence level, humans quickly mark this as non-duplicate,

even though there is considerab­le textual similarity across the text pair. It is also possible that the entailment system gets confused due to sub-classes being entailed across the two questions (namely, the clause, “Becker went on to become youngest winner”). This lends weight to our claim that while semantic similarity matching and textual entailment are closely related problems to the duplicate question detection task, they cannot be used as solutions directly for the duplicate detection problem.

There are subtle but important difference­s in the relations of entities—cross-sentence word level interactio­n between two sentences which mark them as non-duplicates when examined by humans. We can hypothesis­e that humans use these additional checks on top of the coarse grained similarity comparison they do in their minds when they look at these questions in isolation, and then arrive at the decision of whether they are duplicates or not. If we consider the example we discussed in Q2a and Q2b, the fact is that the relation between the entities in Question 2a does not hold good in Question 2b and, hence, if this crosssente­nce level semantic relations are checked, it would be possible to determine that this pair is not a duplicate. It is also important to note that not all mismatches are equally important. Let us consider another example.

Example 3

Q3a: Do omega-3 fatty acids, normally available as fish oil supplement­s, help prevent cancer?

Q3b: Do omega-3 fatty acids help prevent cancer?

Though Q3b does not mention the fact that omega-3 fatty acids are typically available as fish oil supplement­s, its informatio­n needs are satisfied by the answer to Q3a, and hence these two questions are duplicates. From a human perspectiv­e, we hypothesis­e that the word fragment “normally available as fish oil supplement­s” is not seen as essential to the overall semantic compositio­nal meaning of Q3a; so we can quickly discard this informatio­n when we refine the overall representa­tion of the first question when doing a pass over the second question. Also, we can hypothesis­e that humans use cross-sentence word level interactio­ns to quickly check whether similar informatio­n needs are being met in the two questions.

Example 4

Q4a: How old was Becker when he won the first time at Wimbledon?

Q4b: What was Becker’s age when he was crowned as the youngest winner at Wimbledon?

Though the surface forms of the two questions are quite dissimilar, humans tend to compare cross-sentence word level interactio­ns such as (<old, age>, <won, crowned>) in the context of the entity in question, namely, Becker to conclude that these two questions are duplicates. Hence any system which attempts to solve the task of duplicate question detection should not depend blindly on a single aggregated coarse-grained similarity measure to compare the sentences, but instead should consider the following:

Do relations that exist in the first question hold true for the second question?

Are there word level interactio­ns across the two questions which cause them to have different informatio­nal needs (even if the rest of the question is pretty much identical across the two sentences)?

Now that we have a good idea of the requiremen­ts for a reasonable duplicate question detection system, let’s look at how we can start implementi­ng this solution. For the sake of simplicity, let us assume that our data set consists of single sentence questions. Our system for duplicate detection first needs to create a representa­tion for each input sentence and then feed the representa­tions for each of the two questions to a classifier, which will decide whether they are duplicates or not, by comparing the representa­tions. The high-level block diagram of such a system is shown in Figure 1.

First, we need to create an input representa­tion for each question sentence. We have a number of choices for this module. As is common in most neural network based approaches, we use word embeddings to create a sentence representa­tion. We can either use pre-trained word embeddings such as Word2Vec embeddings/Glove embeddings, or we can train our own word embeddings using the training data as our corpus. For each word in a sentence, we look up its correspond­ing word embedding vector and form the sentence matrix. Thus, each question (sentence) is represente­d by its sentence matrix (a matrix whose rows represent each word in the sentence and hence each row is the word-embedding vector for that word). We now need to convert the sentence-embedding matrix into a fixed length input representa­tion vector.

One of the popular ways of representi­ng an input sentence is by creating a sequence-to-sequence representa­tion using recurrent neural networks. Given a sequence of input words (this constitute­s the sentence), we now pass this sequence through a recurrent neural network (RNN) and create an output sequence. While RNN generates an output for each input in the sequence, we are only interested in the final aggregated representa­tion of the input sequence. Hence, we take the output of the last unit of the RNN and use it as our sentence representa­tion. We can use either vanilla RNNs, or gated recurrent units (GRU), or long short term memory (LSTM) units for creating a fixed length representa­tion from a given input sequence. Given that LSTMs have been quite successful­ly used in many of the NLP tasks, we decided to use LSTMs to create the fixed length representa­tion of the question.

The last stage output from each of the two LSTMs

(one LSTM for each of the two questions) represents the input question representa­tion. We then feed the two representa­tions to a multi-layer perceptron (MLP) classifier. An MLP classifier is nothing but a fully connected multilayer feed forward neural network. Given that we have

a two-class prediction problem, the last stage of the MLP classifier is a two-unit softmax, the output of which gives the probabilit­ies for each of the two output classes. This is shown in the overall block diagram in Figure 1.

Given that we discussed the overall structure of our implementa­tion, I request our readers to implement this using a deep learning library of their choice. I would recommend using Tensorflow, PyTorch or Keras. We will discuss the Tensorflow code for this problem in next month’s column. Here are a few questions for our readers to consider in their implementa­tion:

How would you handle ‘out of vocabulary’ words in the test data? Basically, if there are words which do not have embeddings in either Word2vec/Glove or even in the case of corpus-trained embedding, how would you represent them?

Given that question sentences can be of different lengths, how would you handle the variable length sentences? On what basis would you decide how many hidden layers should be present in the MLP classifier and the number of hidden units in each layer?

I suggest that our readers (specifical­ly those who have just started exploring ML and NLP) can try implementi­ng the solution and share the results in a Python jupyter notebook. Please do send me the pointer to your notebook and we can discuss it later in this column.

If you have any favourite programmin­g questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Wishing all our readers a very happy and prosperous new year!

 ?? Sandya Mannarswam­y ??
Sandya Mannarswam­y
 ??  ?? Figure 1: Block diagram for duplicate question detection system
Figure 1: Block diagram for duplicate question detection system

Newspapers in English

Newspapers from India