CodeSport
In this month’s column, we discuss a few computer science interview questions.
This month, let’s discuss a set of computer science interview questions, focusing on natural language processing, machine learning as well as on traditional programming problems. Please note that it is difficult to discuss the complete solution to the questions due to the large number of questions we cover in this column. Instead, I encourage you to send me your solutions to the questions posed, directly, so that I can provide feedback on the same. 1. You are given the problem of disambiguating entities in a given set of documents. For example, for the word ‘Apple’, you will need to disambiguate between the use of the word ‘Apple’ as an corporate entity and its use as a word that describes a fruit. A more complicated example would be the following: S1: The White House announced its intention to support the newly formed government in Vanaru. S2: The New Year Party will be held at the White House, and will be hosted by Michelle Obama.
In sentence S1, the White House is an entity that actually represents the American government. In sentence S2, the White House is a location. Can you come up with an algorithm which can disambiguate such mentions of this entity to their appropriate types, based on the context. 2. In information retrieval, you are given a set of documents and a query. You are asked to retrieve the set of documents that is most relevant to the query. Many of the common information retrieval techniques are built around the measure of TF-IDF, where TF stands for the term frequency, i.e., the number of times a query keyword appears in a document. For example, given the query ‘Tendulkar the Indian Cricketer’, I am asked to find all documents relevant to this query. It is obvious that one can compute the term frequency for the query terms, and return a ranked list of documents in which the query terms appear most frequently. If this suffices to return the most relevant documents, why do we need the measure of the inverse document frequency (IDF)? Can you illustrate with an example why IDF is needed in information retrieval? 3. Parsing is one of the major components of a natural language processing pipeline. Depending on the nature of the task, different types of parsing of the input text may be required. Can you explain the differentiation between shallow parsing, dependency parsing and constituency parsing? Given the sentence, “John went to New York yesterday by the midnight flight,” can you provide the output for each of these parses? On a related note, when would you use POS tagging instead of any parsing? 4. You have been asked to build an application that analyses the conversations in call centres. These conversations typically happen between the agent and the customer. An automatic speech recognition system has been used to convert the voice conversations into text. The first stage of your application needs to analyse the voice transcripts, and correct any mistakes in the words that would have been caused by the automatic speech recognition system. Can you explain how you would build the transcript correction system? 5. An easier and related problem to the question
above is analysing chat text in call centres. Chat communication happens between the agent and the customer in call centres, and you are given the chat text. Since this is the direct typed text from the agent and customers, there can be frequent spelling mistakes, non-standard annotations/greetings, etc. The first stage of your application needs to analyse the chat text, find the misspelt words and replace them with the corrected version. Your first thought is to use a standard dictionary to identify words that are not in the dictionary, and then find the words that are the best replacement. Can you come up with an algorithm which finds their best replacement? How would you handle misspelt words that are domain specific?
6. You are given a corpus of documents, and you are asked to find the important phrases in the corpus. The important phrases can be either unigrams, bigrams or trigrams. Can you come up with a simple algorithm that can do this? If you want to compare the performance of your algorithm with a standard one, you can use gensim.models.phrases. For more details, you can refer to https://radimrehurek.com/gensim/models/phrases. html and check how your algorithm is performing compared to that.
7. Naïve Bayes and logistic regression are two popular algorithms used in machine learning. When will you prefer to use Naïve Bayes instead of logistic regression, and vice versa?
8. Topic models are very popular in natural language processing to identify the topics associated with a set of documents, and the identified topics can be used to classify/cluster the documents. Can you explain the basic idea behind the topic model? Can you differentiate between topic modelling and latent semantic indexing?
9. Continuing with topic models, general topic model approaches output a set of unnamed topics, with a set of keywords associated with each of these unnamed topics. Then a domain expert needs to manually annotate these topics with their topic names. Is it possible to avoid this manual labelling of the topics generated by the topic modelling techniques?
10. One of the major issues in supervised learning is overfitting of the model to training data. What causes overfitting? How can you avoid it?
11. Distributional semantics is a popular concept in natural language processing and, of late, word embeddings that are based on the concept of distributional semantics are widely used in NLP. Word2vec and Glove are two popular techniques for generating word vectors or word embeddings. Can you explain either of these two techniques? 12. You are given a corpus and you have created the word embeddings for your corpus using the popular Word2Vec technique. Now you need to represent each sentence in your corpus by means of a vector. You need to compose the word vectors of the words present in your sentence to generate a sentence vector representation. What are the possible compositional models you can apply to generate the sentence vector?
13. You are given the problem of identifying different types of relations in a text. You have come up with two different classifiers that are performing equally well, and you are unable to decide, based on your currently available data, which classifier to select. How would you decide which of the two classifiers to use?
14. What is ensemble learning? When would you use an ensemble of classifiers as opposed to a single classifier? Among the different ensemble learning methods, can you explain bagging and boosting?
15. You are asked to build a system that can do ‘parts of speech tagging’ for a given sentence. While parts of speech (POS) tagging assigns independent labels for each of the words in the sentence, it uses the sequence of words it has seen in the context to decide on the current label. For example, consider the sentence: “Robert saw the saw on the table and took it outside to the shed.” The first occurrence of ‘saw’ in the above sentence is a verb, whereas the second occurrence of the word ‘saw’ is a noun. By looking at the sequence of words occurring around the specific word, a POS tagging system can distinguish between the tags it should assign for each of the two occurrences of the word ‘saw’. Hence, POS tagging is an example of sequence labelling. Two of the important algorithms for sequence labelling are Hidden Markov Models and Conditional Random Fields. Can you explain each of them?
16. Consider the two sentences:
S1: The bank will remain closed on Saturday. S2: He sat on the bank and watched the sunset.
The word ‘bank’ has two meanings—one as a commercial institution and another as the sand bank by the side of the river. Depending on the frame of reference, the appropriate context has to be assigned to the word ‘bank’. This is the problem of word sense disambiguation. Can you come up with an algorithm for word sense disambiguation using the supervised technique (provided that you are given sufficient labelled data)? How would your system handle words whose multiple senses are not present in the training data? What are the trade-offs in using a supervised approach vs an unsupervised approach towards word sense disambiguation.
17. Distributional semantics is based on the idea that semantic similarities can be characterised based on the distributional properties of the words in a large corpus. This is popularly expressed by the statement, “A word is known by the company it keeps.” If two words have similar distributions or similar contexts, then they have similar meanings. Can you give a couple of examples of the applications of distributional semantics in natural language processing? 18. Machine translation is the automatic translation of text from a source language to a target language. Machine translation can either be driven by rule based techniques or by statistical techniques. Can you explain the advantages and disadvantages of both rule based machine translation and statistical machine translation? 19. What is transfer learning? Can you illustrate it
with an example? 20. Many of the standard NLP systems have been designed for reasonable document sizes, be it automatic key phrase recognition, topic modelling or summarisation. Consider that you have been asked to design a NLP pipeline to analyse the tweet texts. Tweets are typically very short text and are characterised by the extensive use of non-standard abbreviations. What are the major NLP challenges in analysing short text? If you have any favourite programming questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!