CODE SPORT
In this month’s column, we discuss a real life NLP problem, namely, detecting duplicate questions in community question-answering forums.
While we have been discussing many questions in machine learning (ML) and natural language processing (NLP), I had a number of requests from our readers to take up a real life ML/NLP problem with a sufficiently large data set, discuss the issues related to this specific problem and then go into designing a solution. I think it is a very good suggestion. Hence, over the next few columns, we will be focusing on one specific real life NLP problem, which is detecting duplicate questions in community question-answering (CQA) forums.
There are a number of popular CQA forums such as Yahoo Answers, Quora and StackExchange where netizens post their questions and get answers from domain experts. CQA forums serve as a common means of distilling crowd intelligence and sharing it with millions of people. From a developer perspective, sites such as StackOverflow fill an important need by providing guidance and help across the world, 24x7. Given the enormous number of people who use such forums, and their varied skill levels, many questions get asked again and again.
Since many users have similar informational needs, answers to new questions can typically be found either in whole or part from the existing question-answer archive of these forums. Hence, given a new incoming question, these forums typically display a list of similar or related questions, which could immediately satisfy the information needs of users, without them having to wait for their new question to be answered by other users. Many of these forums use simple keyword/tag based techniques for detecting duplicate questions.
However, often, these automated lists returned by the forums are not accurate, frustrating users looking for answers. Given the challenges in identifying duplicate questions, some forums put in manual effort to tag duplicate questions. However, this is not scalable, given the rate at which new questions get generated, and the need for specific domain expertise to tag a question as duplicate. Hence, there is a strong requirement for automated techniques that can help in identifying questions that are duplicates of an incoming question.
Note that identifying duplicate questions is different from identifying ‘similar/related’ questions. Identifying similar questions is somewhat easier as it only requires that there should be considerable similarity between a question pair. On the other hand, in the case of duplicate questions, the answer to one question can serve as the answer to the second question. This identification requires stricter and more rigorous analysis.
At first glance, it appears that we can use various text similarity measures in NLP to identify duplicate questions. Given that people express their information needs in widely different forms, it is a big challenge to identify the exact duplicate questions automatically. For example, let us consider the following two questions:
Q1: I am interested in trying out local cuisine. Can you please recommend some local cuisine restaurants that are wallet-friendly in Paris?
Q2: I like to try local cuisine whenever I travel. I would like some recommendations for restaurants which are not too costly, but serve authentic local cuisine in Athens?
Now consider applying different forms of text similarity measures. The above two questions score very high on various similarity measures— lexical, syntactic and semantic similarity. While it is quite easy for humans to focus on the one dissimilarity, which is that the locations discussed in the two questions are different, it is not easy to teach machines that ‘some dissimilarities are more important than other dissimilarities.’ It also raises the question of whether the two words ‘Paris’ and ‘Athens’ would be considered as extremely