In this month’s column, we continue our discussion on detecting duplicate questions in community question answering forums.
Based on our readers’ requests to take up a real life ML/NLP problem with a sufficiently large data set, we had started on the problem of detecting duplicate questions in community question answering (CQA) forums using the Quora Question Pair Dataset.
Let’s first define our task as follows: Given a pair of questions <Q1, Q2>, the task is to identify whether Q2 is a duplicate of Q1, in the sense that, will the informational needs expressed in Q1 satisfy the informational needs of Q2? In simpler terms, we can say that Q1 and Q2 are duplicates from a lay person’s perspective if both of them are asking the same thing in different surface forms.
An alternative definition is to consider that Q1 and Q2 are duplicates if the answer to Q1 will also provide the answer to Q2. However, we will not consider the second definition since we are concerned only with analysing the informational needs expressed in the questions themselves and have no access to answer text. Therefore, let’s define our task as a binary classification problem, where one of the two labels (duplicate or non-duplicate) needs to be predicted for each given question pair, with the restriction that only the question text is available for the task and not answer text.
As I pointed out in last month’s column, a number of NLP problems are closely related to duplicate question detection. The general consensus is that duplicate question detection can be solved as a by-product by using these techniques themselves. Detecting semantic text similarity and recognising textual entailment are the closest in nature to that of duplicate question detection. However, given that the goal of each of these problems is distinct from that of duplicate question detection, they fail to solve the latter problem adequately. Let me illustrate this with a few example question pairs. Example 1
Q1a: What are the ways of investing in the share market?
Q1b: What are the ways of investing in the share market in India?
One of the state-of-art tools available online for detecting semantic text similarity is SEMILAR (http://www.semanticsimilarity.org/). A freely available state-of-art tool for entailment recognition is Excitement Open Platform or EOP (http://hltservices4.fbk.eu/eop/index.php). SEMILAR gave a semantic similarity score of 0.95 for the above pair whereas EOP reported it as textual entailment. However, these two questions have different information needs and hence they are not duplicates of each other.
Q2a: In which year did McEnroe beat Becker, who went on to become the youngest winner of the Wimbledon finals?
Q2b: In which year did Becker beat McEnroe and go on to become the youngest winner in the finals at Wimbledon?
SEMILAR reported a similarity score of
0.972 and EOP marked this question pair as entailment, indicating that Q2b is entailed from
Q2a. Again, these two questions are about entirely two different events, and hence are not duplicates. We hypothesise that humans are quick to see the difference by extracting the relations that are being sought for in the two questions. In Q2a, the relational event is “<McEnroe (subject), beat (predicate), Becker (object)> whereas in Q2b, the relational event is <Becker (subject), beat (predicate), McEnroe (object)> which is a different relation from that in Q2a. By quickly scanning for a relational match/mismatch at the cross-sentence level, humans quickly mark this as non-duplicate,
even though there is considerable textual similarity across the text pair. It is also possible that the entailment system gets confused due to sub-classes being entailed across the two questions (namely, the clause, “Becker went on to become youngest winner”). This lends weight to our claim that while semantic similarity matching and textual entailment are closely related problems to the duplicate question detection task, they cannot be used as solutions directly for the duplicate detection problem.
There are subtle but important differences in the relations of entities—cross-sentence word level interaction between two sentences which mark them as non-duplicates when examined by humans. We can hypothesise that humans use these additional checks on top of the coarse grained similarity comparison they do in their minds when they look at these questions in isolation, and then arrive at the decision of whether they are duplicates or not. If we consider the example we discussed in Q2a and Q2b, the fact is that the relation between the entities in Question 2a does not hold good in Question 2b and, hence, if this crosssentence level semantic relations are checked, it would be possible to determine that this pair is not a duplicate. It is also important to note that not all mismatches are equally important. Let us consider another example.
Q3a: Do omega-3 fatty acids, normally available as fish oil supplements, help prevent cancer?
Q3b: Do omega-3 fatty acids help prevent cancer?
Though Q3b does not mention the fact that omega-3 fatty acids are typically available as fish oil supplements, its information needs are satisfied by the answer to Q3a, and hence these two questions are duplicates. From a human perspective, we hypothesise that the word fragment “normally available as fish oil supplements” is not seen as essential to the overall semantic compositional meaning of Q3a; so we can quickly discard this information when we refine the overall representation of the first question when doing a pass over the second question. Also, we can hypothesise that humans use cross-sentence word level interactions to quickly check whether similar information needs are being met in the two questions.
Q4a: How old was Becker when he won the first time at Wimbledon?
Q4b: What was Becker’s age when he was crowned as the youngest winner at Wimbledon?
Though the surface forms of the two questions are quite dissimilar, humans tend to compare cross-sentence word level interactions such as (<old, age>, <won, crowned>) in the context of the entity in question, namely, Becker to conclude that these two questions are duplicates. Hence any system which attempts to solve the task of duplicate question detection should not depend blindly on a single aggregated coarse-grained similarity measure to compare the sentences, but instead should consider the following:
Do relations that exist in the first question hold true for the second question?
Are there word level interactions across the two questions which cause them to have different informational needs (even if the rest of the question is pretty much identical across the two sentences)?
Now that we have a good idea of the requirements for a reasonable duplicate question detection system, let’s look at how we can start implementing this solution. For the sake of simplicity, let us assume that our data set consists of single sentence questions. Our system for duplicate detection first needs to create a representation for each input sentence and then feed the representations for each of the two questions to a classifier, which will decide whether they are duplicates or not, by comparing the representations. The high-level block diagram of such a system is shown in Figure 1.
First, we need to create an input representation for each question sentence. We have a number of choices for this module. As is common in most neural network based approaches, we use word embeddings to create a sentence representation. We can either use pre-trained word embeddings such as Word2Vec embeddings/Glove embeddings, or we can train our own word embeddings using the training data as our corpus. For each word in a sentence, we look up its corresponding word embedding vector and form the sentence matrix. Thus, each question (sentence) is represented by its sentence matrix (a matrix whose rows represent each word in the sentence and hence each row is the word-embedding vector for that word). We now need to convert the sentence-embedding matrix into a fixed length input representation vector.
One of the popular ways of representing an input sentence is by creating a sequence-to-sequence representation using recurrent neural networks. Given a sequence of input words (this constitutes the sentence), we now pass this sequence through a recurrent neural network (RNN) and create an output sequence. While RNN generates an output for each input in the sequence, we are only interested in the final aggregated representation of the input sequence. Hence, we take the output of the last unit of the RNN and use it as our sentence representation. We can use either vanilla RNNs, or gated recurrent units (GRU), or long short term memory (LSTM) units for creating a fixed length representation from a given input sequence. Given that LSTMs have been quite successfully used in many of the NLP tasks, we decided to use LSTMs to create the fixed length representation of the question.
The last stage output from each of the two LSTMs
(one LSTM for each of the two questions) represents the input question representation. We then feed the two representations to a multi-layer perceptron (MLP) classifier. An MLP classifier is nothing but a fully connected multilayer feed forward neural network. Given that we have
a two-class prediction problem, the last stage of the MLP classifier is a two-unit softmax, the output of which gives the probabilities for each of the two output classes. This is shown in the overall block diagram in Figure 1.
Given that we discussed the overall structure of our implementation, I request our readers to implement this using a deep learning library of their choice. I would recommend using Tensorflow, PyTorch or Keras. We will discuss the Tensorflow code for this problem in next month’s column. Here are a few questions for our readers to consider in their implementation:
How would you handle ‘out of vocabulary’ words in the test data? Basically, if there are words which do not have embeddings in either Word2vec/Glove or even in the case of corpus-trained embedding, how would you represent them?
Given that question sentences can be of different lengths, how would you handle the variable length sentences? On what basis would you decide how many hidden layers should be present in the MLP classifier and the number of hidden units in each layer?
I suggest that our readers (specifically those who have just started exploring ML and NLP) can try implementing the solution and share the results in a Python jupyter notebook. Please do send me the pointer to your notebook and we can discuss it later in this column.
If you have any favourite programming questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Wishing all our readers a very happy and prosperous new year!
Figure 1: Block diagram for duplicate question detection system