CodeSport
In this month’s column, we discuss a few computer science interview questions.
Over the past couple of months, we have been discussing the problem of detecting duplicate questions in community question answering (CQA) forums using the Quora Question Pair Dataset. In this month’s column, we take a break from this topic and, as requested by our readers, discuss a few interview questions, focusing on algorithms, machine learning and text analytics.
1. In machine learning, there are three classes of errors – namely, training error, validation error and test error. Can you differentiate between each of these three classes of errors? During training, we focus on reducing the training error. In real world deployment, when the classifier is employed on unseen data, how can reducing training errors help to predict better?
2. There are two classifiers you have trained for the given image classification problem. Classifier A has a training error of 4.2 per cent and Classifier B has a training error of 0.8 per cent. If you are using Classifiers A and B on a held-out test data set, what can you say about the test accuracies of A and B? Would B have a better test accuracy than A? Provide the rationale behind your answer.
3. Can you explain the concept of generalisation in the context of machine learning? What are the measures that can be used to explain the generalisation ability of learning systems?
4. What is meant by the Vapnik–Chervonenkis (VC) dimension in machine learning? Why is this an important classifier? What is the VC dimension of a linear classifier?
5. Auto-encoders are a class of unsupervised deep learning systems that take an input X and produce an output Y, which is as close to X as possible. Autoencoders can be a deep multi-layer neural network. Since they try to reproduce the input given to them at the output, are they learning an identity function? What’s the rationale behind your answer?
6. As you know, while training a deep learning network, the optimisation objective is to minimise the loss function defined for that classification problem. Given that non-convex optimisation functions can have multiple local minima, do firstorder gradient descent based optimisation methods ensure that we reach a global minimum? If you think they do, explain how. If not, explain why. 7. Distributional semantics is based on the idea that semantic similarities can be characterised based on the distributional properties of the words in a large corpus. This is popularly expressed by the statement, “A word is known by the company it keeps.” If two words have similar distributions or similar contexts, then they have similar meanings. Can you give a couple of examples of the applications of distributional semantics in natural language processing?
8. Traditionally, text inputs to statistical NLP systems have used one hot vector representation of words.
For instance, given a vocabulary of size V, each word is represented by a vector of size V, where only one index is set to one, corresponding to that word, and all the remaining indices in the vector are set to zero. This is known as the ‘one hot vector representation’. However, of late, vector space word models are becoming quite popular as alternative input representations. Models such as Word2Vec (https://radimrehurek.com/gensim/models/word2vec. html) and Glove (http://nlp.stanford.edu/projects/ glove/) are quite popular these days for input representation in statistical NLP systems, as opposed to representing words as one hot vector. What is the advantage of using vector space word models over traditional one hot vector representation? Are there any disadvantages with such vector space models? 9. You are given a corpus and you have created the word embeddings for your corpus using the popular Word2Vec technique. Now you need to represent each sentence in your corpus by means of a vector. You need to compose the word vectors of the words in your sentence to generate a sentence vector
representation. What are the possible compositional models you can apply to generate the sentence vector?
10. In information retrieval, you are given a set of documents and a query. You are asked to retrieve the set of documents that are most relevant to the query. Many of the common information retrieval techniques are built around the measure of TF-IDF, where TF stands for the term frequency, i.e., the number of times a query keyword appears in a document. It is obvious that one can compute the term frequency for the query terms, and return a ranked list of documents in which the query terms appear most frequently. If this suffices to return the most relevant documents, why do we need the measure of the inverse document frequency (IDF)? Can you illustrate with an example why IDF is needed in information retrieval?
11. You have been given the problem of sentiment analysis, where, given a sentence, you need to classify it as bearing positive, negative or neutral sentiments. You have been given large quantities of labelled data of example sentences, with their target sentiment polarity annotated. Given that there can be words in the test input sentences that are not present in the training set examples, how would your approach handle the ‘out of vocabulary’ words in the test sentences? If you decide to just ignore the ‘out of vocabulary’ words, what would be the impact on the test set classification accuracy? Can you devise a better approach to handling this issue?
12. In machine learning, the metrics used in comparing the classifier’s performance among different models are precision, recall, area under the Receiver Operating Characteristic (ROC) curve, and the confusion matrix. Can you explain each of these terms?
13. You are asked to build a system that can do ‘parts of speech’ tagging for a given sentence. While ‘parts of speech’ tagging assigns independent labels for each of the words in the sentence, it uses the sequence of words it has seen in the context to decide on the current label. For example, consider the sentence: “Robert found the saw on the table and was taking it outside when he saw the snake.” The first occurrence of ‘saw’ in the above sentence is a verb phrase whereas the second occurrence of it is a noun phrase. By looking at the sequence of words occurring around the specific word, a POS tagging system can distinguish between the tags it should assign for each of the two occurrences of the word ‘saw’. Hence, POS tagging is an example of sequence labelling. Two of the important algorithms for sequence labelling are Hidden Markov Models and Conditional Random Fields. Can you explain each of them?
14. If you are asked to do sequence labelling with a deep neural network, what type of neural network would you employ? Explain the reason behind your choice.
15. Given an array of N integers, you are asked to find out whether the array is an arithmetic series. An arithmetic series is a sequence of numbers where there is a constant difference between any two consecutive numbers of the series. What is the complexity of your function? Now, you are given a modified problem for which your task is to convert a given array into an arithmetic sequence by dropping off those array elements that prevent it from being an arithmetic sequence. How would you solve this problem? What is the time complexity of your function?
16. You are given a list of N integers, where the numbers can only be positive. You are now being asked to write code to create two lists of size N/2 such that the sum of the elements of the two sub-lists are equal, if at all it is possible to create such sub-lists. Your function should return the sum of the sub-list elements if it is possible to split the original list into two such sub-lists or return zero if that is not possible. What is the time complexity of your function? (Note that you can move the elements of the list.) You are now told that the original list can contain either positive or negative integers and again told to solve the above problem. What are the changes you would make to your earlier solution?
17. Given that the same words can have different meanings (this is known as polysemy), how do information retrieval systems figure out what is the specific information that is being searched for? For example, consider a user who searches for ‘windows’. How would the search engine determine whether he/she is looking for Microsoft Windows or glass windows? Blindly displaying search results for both meanings is likely to impact the user experience. If you are the designer of the search engine, explain how you would approach this problem? Consider the related problem. When the user types in the search query as ‘automobile’, how does the search engine include Web pages that talk about cars as well?
18. What is an ensemble of classifiers? We find that, in many cases, an ensemble of classifiers achieves a better performance compared to any individual classifier model. Can you explain the intuition behind the performance improvements achieved by ensemble classifiers?
19. A well-known phrase used in machine learning and statistics is ‘correlation is not the same as causation’. Can you give an example to illustrate this statement?
20. There are three kinds of analytics – namely, prescriptive analytics, descriptive analytics and predictive analytics. Can you explain each of these terms and how they are different from each other?
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Wishing all our readers happy coding until next month!