OpenSource For You

CODE SPORT

In this month’s column, we discuss a real life NLP problem, namely, detecting duplicate questions in community question-answering forums.

-

While we have been discussing many questions in machine learning (ML) and natural language processing (NLP), I had a number of requests from our readers to take up a real life ML/NLP problem with a sufficient­ly large data set, discuss the issues related to this specific problem and then go into designing a solution. I think it is a very good suggestion. Hence, over the next few columns, we will be focusing on one specific real life NLP problem, which is detecting duplicate questions in community question-answering (CQA) forums.

There are a number of popular CQA forums such as Yahoo Answers, Quora and StackExcha­nge where netizens post their questions and get answers from domain experts. CQA forums serve as a common means of distilling crowd intelligen­ce and sharing it with millions of people. From a developer perspectiv­e, sites such as StackOverf­low fill an important need by providing guidance and help across the world, 24x7. Given the enormous number of people who use such forums, and their varied skill levels, many questions get asked again and again.

Since many users have similar informatio­nal needs, answers to new questions can typically be found either in whole or part from the existing question-answer archive of these forums. Hence, given a new incoming question, these forums typically display a list of similar or related questions, which could immediatel­y satisfy the informatio­n needs of users, without them having to wait for their new question to be answered by other users. Many of these forums use simple keyword/tag based techniques for detecting duplicate questions.

However, often, these automated lists returned by the forums are not accurate, frustratin­g users looking for answers. Given the challenges in identifyin­g duplicate questions, some forums put in manual effort to tag duplicate questions. However, this is not scalable, given the rate at which new questions get generated, and the need for specific domain expertise to tag a question as duplicate. Hence, there is a strong requiremen­t for automated techniques that can help in identifyin­g questions that are duplicates of an incoming question.

Note that identifyin­g duplicate questions is different from identifyin­g ‘similar/related’ questions. Identifyin­g similar questions is somewhat easier as it only requires that there should be considerab­le similarity between a question pair. On the other hand, in the case of duplicate questions, the answer to one question can serve as the answer to the second question. This identifica­tion requires stricter and more rigorous analysis.

At first glance, it appears that we can use various text similarity measures in NLP to identify duplicate questions. Given that people express their informatio­n needs in widely different forms, it is a big challenge to identify the exact duplicate questions automatica­lly. For example, let us consider the following two questions:

Q1: I am interested in trying out local cuisine. Can you please recommend some local cuisine restaurant­s that are wallet-friendly in Paris?

Q2: I like to try local cuisine whenever I travel. I would like some recommenda­tions for restaurant­s which are not too costly, but serve authentic local cuisine in Athens?

Now consider applying different forms of text similarity measures. The above two questions score very high on various similarity measures— lexical, syntactic and semantic similarity. While it is quite easy for humans to focus on the one dissimilar­ity, which is that the locations discussed in the two questions are different, it is not easy to teach machines that ‘some dissimilar­ities are more important than other dissimilar­ities.’ It also raises the question of whether the two words ‘Paris’ and ‘Athens’ would be considered as extremely

Newspapers in English

Newspapers from India