In this month’s column, we continue our discussion on detecting duplicate questions in community question-answering forums.
Let’s continue exploring the topic we started out on in last month’s column, in which we discussed the problem of detecting duplicate questions in community question answering (CQA) forums using Quora’s question pair data set.
Given a pair of questions <Q1, Q2>, the task is to identify whether Q2 is a duplicate of Q1. Our system for duplicate detection first needs to create a representation for each input sentence, and then feed the representations for each of the two questions to a classifier which will decide whether they are duplicates or not by comparing the representations.
In this month’s column, I will provide some of the skeleton code functions which can help to implement this solution. I have deliberately not provided the complete code for the problem, as I would like readers to build their own solutions and become familiar with creating simple neural network models from scratch.
As discussed in last month’s column, while there are multiple methods of creating a sentence representation, we can simply use a concatenation of word embeddings of the individual words in the question sentence to create a question embedding representation. We can either learn the word embeddings specific for the task of duplicate question-detection (provided our corpus is large and general enough), or we can use pre-trained word embeddings such as Word2vec.
In our example, we will use Word2vec embeddings and concatenate the embeddings of individual words to represent the question sentence. Assuming that we use Word2vec embeddings of dimension D, each input question can be represented by (NXD), where N is the number of words in the question and D is the embedding size. We need to feed in the input representation for each of the two questions we are comparing to the neural network model.
Before we look at the actual code for the neural network model, we need to decide which deep learning framework we would use to implement this model. There are a number of choices such as Theano, MxNet, Keras, CNTK, Torch, PyTorch, Caffe, Tensorflow, etc. A brief comparison of some of the popular deep learning frameworks is covered in the article at https:// deeplearning4j.org/comparedl4jtensorflowpytorch.
In selecting a deep learning framework for a project, one needs to consider both ease of programming, maintainability of code and long-term support for the framework. Some of the frameworks, such as Theano, are from academic groups; hence their support may be time-limited. For this project, we decided to go with TensorFlow, given its widespread adoption in the industry (it is sponsored by Google) and its ease of use. We will assume that our readers are familiar with TensorFlow (a quick introduction to it can be found at https://www.tensorflow.org/get_ started/get_started).
Let us assume that we have a binary file which contains the word to embedding mapping for the Word2vec embeddings (pretrained word embeddings are available from either https://code.google.com/ archive/p/word2vec/ if you want to use the Word2vec model or https://nlp.stanford.edu/projects/glove/ if you want to use the Glove vectors).
First, let’s read our training corpus and build a vocabulary list that contains all the words in our training corpus. Next, let’s build a map which maps each word to a valid Word2vec embedding. Shown below is the skeleton code for this:
def create_word_map(w2v_file, vocab_list): data = np.load(w2v_file) glove_array = data[‘glove’] for word in vocab_list: idx = vocab_list.index(word) w2v_wordmap[word] = glove_array[idx]
Next, we will create the sentence matrix associated with each question, as follows:
def question2sequence(ques): tokens = get_tokens(ques) rows =  #represents the sentence embedding matrix #Greedy search for tokens for token in tokens:
assert (token in w2v_wordmap), word + “not found in w2v_map”
if len(tokens) < MAX_QN_LEN:
#question is too short
#we need to pad the question up to max sequence length j = MAX_QN_LEN - len(tokens) word = _UNK while j > 0: rows.append(w2v_wordmap[word]) j = j -1 return rows
Now, let’s convert this sentence matrix into a fixed-length vector representation of the question. Given a sequence of input words (this constitutes the question sentence), we now pass this sequence through a recurrent neural network (RNN) and create an output sequence. We can use either vanilla RNNs, gated recurrent units (GRU) or long short term memory (LSTM) units for creating a fixed-length representation from a given input sequence. Given that LSTMs have been quite successfully used in many of the NLP tasks, we decided to use them to create the fixed-length representation of the question.
Import tensorflow as tf question1 = tf.placeholder(tf.float32, [N, l_q1, D], ‘question1’) lstm_cell = tf.contrib.rnn.BasicLSTMCell(num_lstm_units) value, state = tf.nn.dynamic_rnn(lstm_cell, question1, dtype=tf.float32)
While RNN generates an output for each input in the sequence, we are only interested in the final aggregated representation of the input sequence. Hence, we take the output of the LSTM at the last time step and use it as our sentence representation. Note that the last time step corresponds to the last word in the sentence being fed to the LSTM. Hence, the LSTM output corresponds to an aggregated representation of the current word and all
the words that come before it. Hence, it represents the complete sentence.
value1 = tf.transpose(value, [1, 0, 2]) lstm_output = tf.gather(value1, int(value1.get_shape()) - 1)
Just as we obtained a fixed-length representation for Question 1, we also created a fixed-length representation for Question 2 using a second LSTM. The last stage output from each of the two LSTMs (one LSTM for each of the two questions) represents the input question representation. We can then concatenate these two representations and feed it to the multilayer perceptron classifier.
An MLP classifier is nothing but a fully connected multilayer feed forward neural network. Given that we have a two-class prediction problem, the last stage of the MLP classifier is a two-unit softmax, whose output gives the probabilities for each of the two output classes. Here is the skeletal code for an MLP classifier with three densely connected feed forward layers, with 256, 128 and two units each:
predict_layer_one_out = tf.layers.dense(lstm_output,
name=”prediction_layer_one”) predict_layer_two_out = tf.layers.dense(dropout_predict_ layer_one_out,
name=”prediction_layer_two”) predict_layer_logits = tf.layers.dense(predict_layer_two_out,
Here are a couple of questions for our readers to think about:
How do you decide on the number of hidden layers and the number of units in each hidden layer for your MLP classifier?
In our problem, we are doing binary classification as we need to predict whether a question is duplicate or not. How would this code change if you had to predict one out of K different classes (for example, if you are trying to predict which category a particular question may belong to)?
The last layer output is used as input to a TensorFlow network loss computation node, which computes the crossentropy loss using the ground truth labels as shown below:
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( logits=predict_
layer_logits, labels=labels)) optimizer = tf.train.AdamOptimizer().minimize(loss)
The network is then trained with ground truth labels during the training phase to select network weights such that cross-entropy loss is minimised.
We also need to add code that can compute the accuracy during the training phase.
As we had discussed in earlier columns on neural networks, gradient descent techniques are typically used to learn the network parameters/weights. Typically, batch gradient descent is used for weight updates during the training process. Here are a couple of questions for our readers:
Why do we prefer batch gradient descent over full gradient descent or stochastic gradient descent?
How do you choose a good batch size for your implementation?
Once we have built the neural network model, we can train our model with labelled examples. Remember that each training loop consists of going over all the training samples once. This is typically known as an ‘epoch’. Each epoch consists of several batch-sized runs, wherein at the end of each batch, the gradients computed are used to update the network weights/parameters. In order to ensure that the network is learning correctly, we need to measure the total loss at the end of each epoch and verify that the total loss is decreasing at the end of each successive epoch.
Now we need to decide when we should stop the training process. One simple but naïve way of stopping the training process is after a fixed number of epochs. Another option is to stop training after we have reached a training accuracy of 100 per cent.
Here is a question for our readers: While we can decide to stop training after some fixed number of epochs or after a training accuracy of 100 per cent, what are the disadvantages associated with each of these approaches? We will discuss more on this topic as well as on the inference phase in next month’s column.
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Wishing all our readers happy coding until next month!