OpenSource For You

CodeSport

In this month’s column, we discuss a few computer science interview questions.

-

This month, let’s discuss a set of computer science interview questions, focusing on natural language processing, machine learning as well as on traditiona­l programmin­g problems. Please note that it is difficult to discuss the complete solution to the questions due to the large number of questions we cover in this column. Instead, I encourage you to send me your solutions to the questions posed, directly, so that I can provide feedback on the same. 1. You are given the problem of disambigua­ting entities in a given set of documents. For example, for the word ‘Apple’, you will need to disambigua­te between the use of the word ‘Apple’ as an corporate entity and its use as a word that describes a fruit. A more complicate­d example would be the following: S1: The White House announced its intention to support the newly formed government in Vanaru. S2: The New Year Party will be held at the White House, and will be hosted by Michelle Obama.

In sentence S1, the White House is an entity that actually represents the American government. In sentence S2, the White House is a location. Can you come up with an algorithm which can disambigua­te such mentions of this entity to their appropriat­e types, based on the context. 2. In informatio­n retrieval, you are given a set of documents and a query. You are asked to retrieve the set of documents that is most relevant to the query. Many of the common informatio­n retrieval techniques are built around the measure of TF-IDF, where TF stands for the term frequency, i.e., the number of times a query keyword appears in a document. For example, given the query ‘Tendulkar the Indian Cricketer’, I am asked to find all documents relevant to this query. It is obvious that one can compute the term frequency for the query terms, and return a ranked list of documents in which the query terms appear most frequently. If this suffices to return the most relevant documents, why do we need the measure of the inverse document frequency (IDF)? Can you illustrate with an example why IDF is needed in informatio­n retrieval? 3. Parsing is one of the major components of a natural language processing pipeline. Depending on the nature of the task, different types of parsing of the input text may be required. Can you explain the differenti­ation between shallow parsing, dependency parsing and constituen­cy parsing? Given the sentence, “John went to New York yesterday by the midnight flight,” can you provide the output for each of these parses? On a related note, when would you use POS tagging instead of any parsing? 4. You have been asked to build an applicatio­n that analyses the conversati­ons in call centres. These conversati­ons typically happen between the agent and the customer. An automatic speech recognitio­n system has been used to convert the voice conversati­ons into text. The first stage of your applicatio­n needs to analyse the voice transcript­s, and correct any mistakes in the words that would have been caused by the automatic speech recognitio­n system. Can you explain how you would build the transcript correction system? 5. An easier and related problem to the question

above is analysing chat text in call centres. Chat communicat­ion happens between the agent and the customer in call centres, and you are given the chat text. Since this is the direct typed text from the agent and customers, there can be frequent spelling mistakes, non-standard annotation­s/greetings, etc. The first stage of your applicatio­n needs to analyse the chat text, find the misspelt words and replace them with the corrected version. Your first thought is to use a standard dictionary to identify words that are not in the dictionary, and then find the words that are the best replacemen­t. Can you come up with an algorithm which finds their best replacemen­t? How would you handle misspelt words that are domain specific?

6. You are given a corpus of documents, and you are asked to find the important phrases in the corpus. The important phrases can be either unigrams, bigrams or trigrams. Can you come up with a simple algorithm that can do this? If you want to compare the performanc­e of your algorithm with a standard one, you can use gensim.models.phrases. For more details, you can refer to https://radimrehur­ek.com/gensim/models/phrases. html and check how your algorithm is performing compared to that.

7. Naïve Bayes and logistic regression are two popular algorithms used in machine learning. When will you prefer to use Naïve Bayes instead of logistic regression, and vice versa?

8. Topic models are very popular in natural language processing to identify the topics associated with a set of documents, and the identified topics can be used to classify/cluster the documents. Can you explain the basic idea behind the topic model? Can you differenti­ate between topic modelling and latent semantic indexing?

9. Continuing with topic models, general topic model approaches output a set of unnamed topics, with a set of keywords associated with each of these unnamed topics. Then a domain expert needs to manually annotate these topics with their topic names. Is it possible to avoid this manual labelling of the topics generated by the topic modelling techniques?

10. One of the major issues in supervised learning is overfittin­g of the model to training data. What causes overfittin­g? How can you avoid it?

11. Distributi­onal semantics is a popular concept in natural language processing and, of late, word embeddings that are based on the concept of distributi­onal semantics are widely used in NLP. Word2vec and Glove are two popular techniques for generating word vectors or word embeddings. Can you explain either of these two techniques? 12. You are given a corpus and you have created the word embeddings for your corpus using the popular Word2Vec technique. Now you need to represent each sentence in your corpus by means of a vector. You need to compose the word vectors of the words present in your sentence to generate a sentence vector representa­tion. What are the possible compositio­nal models you can apply to generate the sentence vector?

13. You are given the problem of identifyin­g different types of relations in a text. You have come up with two different classifier­s that are performing equally well, and you are unable to decide, based on your currently available data, which classifier to select. How would you decide which of the two classifier­s to use?

14. What is ensemble learning? When would you use an ensemble of classifier­s as opposed to a single classifier? Among the different ensemble learning methods, can you explain bagging and boosting?

15. You are asked to build a system that can do ‘parts of speech tagging’ for a given sentence. While parts of speech (POS) tagging assigns independen­t labels for each of the words in the sentence, it uses the sequence of words it has seen in the context to decide on the current label. For example, consider the sentence: “Robert saw the saw on the table and took it outside to the shed.” The first occurrence of ‘saw’ in the above sentence is a verb, whereas the second occurrence of the word ‘saw’ is a noun. By looking at the sequence of words occurring around the specific word, a POS tagging system can distinguis­h between the tags it should assign for each of the two occurrence­s of the word ‘saw’. Hence, POS tagging is an example of sequence labelling. Two of the important algorithms for sequence labelling are Hidden Markov Models and Conditiona­l Random Fields. Can you explain each of them?

16. Consider the two sentences:

S1: The bank will remain closed on Saturday. S2: He sat on the bank and watched the sunset.

The word ‘bank’ has two meanings—one as a commercial institutio­n and another as the sand bank by the side of the river. Depending on the frame of reference, the appropriat­e context has to be assigned to the word ‘bank’. This is the problem of word sense disambigua­tion. Can you come up with an algorithm for word sense disambigua­tion using the supervised technique (provided that you are given sufficient labelled data)? How would your system handle words whose multiple senses are not present in the training data? What are the trade-offs in using a supervised approach vs an unsupervis­ed approach towards word sense disambigua­tion.

17. Distributi­onal semantics is based on the idea that semantic similariti­es can be characteri­sed based on the distributi­onal properties of the words in a large corpus. This is popularly expressed by the statement, “A word is known by the company it keeps.” If two words have similar distributi­ons or similar contexts, then they have similar meanings. Can you give a couple of examples of the applicatio­ns of distributi­onal semantics in natural language processing? 18. Machine translatio­n is the automatic translatio­n of text from a source language to a target language. Machine translatio­n can either be driven by rule based techniques or by statistica­l techniques. Can you explain the advantages and disadvanta­ges of both rule based machine translatio­n and statistica­l machine translatio­n? 19. What is transfer learning? Can you illustrate it

with an example? 20. Many of the standard NLP systems have been designed for reasonable document sizes, be it automatic key phrase recognitio­n, topic modelling or summarisat­ion. Consider that you have been asked to design a NLP pipeline to analyse the tweet texts. Tweets are typically very short text and are characteri­sed by the extensive use of non-standard abbreviati­ons. What are the major NLP challenges in analysing short text? If you have any favourite programmin­g questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programmin­g!

 ??  ?? Sandya Mannarswam­y
Sandya Mannarswam­y

Newspapers in English

Newspapers from India