OpenSource For You

CodeSport

In this month’s column, we discuss a few computer science interview questions.

- By: Sandya Mannarswam­y The author is an expert in systems software and is currently working as a research scientist at Xerox India Research Centre. Her interests include compilers, programmin­g languages, file systems and natural language processing. If yo

1. Containers are becoming the de facto method for applicatio­n deployment, moving away from virtual machines. Docker is a popular container applicatio­n which most of you would be familiar with (https:// www.docker.com/). Docker is an open source project built on top of Linux containers, which allows applicatio­ns to be isolated from each other. Can you explain how Docker differs in providing isolation to applicatio­ns as compared to virtual machines? Can you compare the two methods of applicatio­n deployment, namely, hosting an applicatio­n on a VM vs hosting it on a container, with regard to the following criteria: (a) security, (b) isolation, (c) cost, and (d) fine grained resource control and sharing. 2. When we input a query into a search engine such as Google, we get an ‘auto complete’ suggestion for that query, based on past queries that have been tried on the search engine. Can you explain how you would design an auto-complete mechanism that would scale well across the millions of queries that can happen concurrent­ly at a time? What is the data structure that you would use to store past queries and automatica­lly provide suggestion­s? Would you store the complete query as is, or would you break it into its constituen­t words? Given that scalabilit­y and correctnes­s is the key to this problem, can you explain how the choice of your data structure will address both?

3. In informatio­n retrieval, you are given a set of documents and a query. You are asked to retrieve the set of documents that are most relevant to the query. Many of the common informatio­n retrieval techniques are built around the measure of TFIDF, where TF stands for the term frequency, i.e., the number of times a query keyword appears in a document. For example, given the query ‘Mahendra Singh Dhoni’s test match appearance­s’, I am asked to find all the documents relevant to it. It is obvious that one can compute the Term Frequency for the query terms, and return a ranked list of documents in which the query terms appear most frequently. If this suffices to return the most relevant documents, why do we need the measure of Inverse Document Frequency (IDF)? Can you illustrate, with an example, why IDF is needed in informatio­n retrieval? If you ignore IDF for relevant document retrieval, would it impact the precision or recall, or both?

4. With elections coming around in five Indian states, poll prediction­s have become quite a hot topic. However, there have been mixed results for poll prediction­s. For instance, the past year has been quite unsuccessf­ul for many of the data scientists who are into poll prediction. Neither Brexit nor Trump’s victory were anticipate­d by most pollsters. Given this context, you are asked to predict the party that will win the majority in each of the five states going to the polls in the coming months. For any prediction, you need to build models which can analyse the data and make the prediction. What are the various data inputs that you would be interested in getting? Or, in simplistic terms, what are some of the obvious features that you would like to extract from the data for winning party prediction­s in a state election? Is this a classifica­tion problem or a clustering problem? Why do you think poll prediction is a hard problem?

5. You are given a set of documents and asked to perform clustering of these documents based on document level similarity. You are familiar with different clustering algorithms such as ‘kmeans’, so you are confident that you can do the task. However, the interviewe­r now adds a second part to the problem, wanting you to label the clusters. How would you generate labelled clusters? What are the means of generating meaningful labels for the clusters? Can the clusters be labelled after, prior to, or during the clustering process itself? Describe your choice of solution and why you made it.

6. You are given a set of articles and asked to identify the key phrases correspond­ing to each of the articles. Assume that each article is written in English and contains approximat­ely 10,000 words. You are asked to find up to seven key phrases for each of the documents. How would you design a solution for this problem? Would you propose a supervised or unsupervis­ed learning method for this problem? Justify your choice in terms of its advantages and disadvanta­ges. Now you are told that each of the articles has a fixed structure with components such as an abstract, introducti­on, descriptio­n, results and conclusion. How would you use this informatio­n in your solution?

7. Consider the above question. Now you are told that you are given a black box algorithm which can do topic modelling for a set of documents. How can you use this in your solution for identifyin­g key phrases?

8. Word embedding and sentence/paragraph embedding have become popular of late with the arrival of Google’s word2vec package. Can you explain the intuition behind embedding? Given a corpus of documents, can you explain briefly how one could generate word embedding? Given a set of word embedding that has been generated using a large corpus such as Wikipedia, can you explain how it can be used in different Natural Language Processing tasks? What is the major advantage of embedding representa­tion over other forms of representa­tion such as the count based co-occurrence matrix?

9. Geo-spatial applicatio­ns are getting developed in large numbers these days. Consider a taxi-fleet management company for which the trajectori­es of its taxis need to be stored in a database. Spatial applicatio­ns should be able to query the database for a specific trajectory or segments of a trajectory. It should be possible to answer queries of the form, where given a set of location points, it should be possible to retrieve all the taxis whose trajectori­es had overlapped at the location points. What kind of database would you use for storing trajectory data? To speed up query performanc­e, what kind of indexing would you perform for trajectory data?

10. You are given an array A of N integers. You are also given a number X. You are asked to find the element in A that is closest to X. You are given the constraint that you should not sort the array A. (a) Can you write a C program to solve this problem? (b) If you are told that the set S consists of floating point values instead of integers, how would you modify your solution for (a) to solve (b)?

11. You are given two arrays and asked to find the common elements between them. (a) Assume that the two arrays only contain integer elements. (b) Assume that the two arrays contain floating point values. Would your solution for (a) work for (b) as well?

12. For dengue fever, a diagnostic test has been developed. The new test is 81 per cent reliable (i.e., it misses 19 per cent of cases where the patient actually has the disease). It can wrongly label a patient as having dengue fever in 10 per cent of cases when the patient actually does not have the disease. You are told nothing about the patient’s symptoms or any other clinical informatio­n, except that the patient has tested positive for the dengue diagnostic test. Can you figure out whether the patient actually has the dengue illness from the above given informatio­n? If yes, how? If not, why is that so?

13. We are all familiar with different sorting and searching algorithms. However, these algorithms typically are designed for data that fits in physical memory. Consider that you are given two matrices A and B of dimensions MXN and NXR. The matrix dimensions are such that neither of the matrices will fit in the main memory. You are asked to write an algorithm for multiplyin­g the two matrices and storing the result into a matrix C of dimensions MXR on disk. Can you outline the approach you would take for minimising the number of disk accesses that your matrix_multiply function would perform? Now consider the same problem mentioned in question (1) above. You are told that matrices A and B are such that most of their elements are zero. You are again asked to compute the matrix multiplica­tion of these two matrices. Can you come up with an algorithm which minimises the number of disk accesses?

14. Consider that you are running your applicatio­n on a desktop which has 16GB physical memory and has the Linux operating system installed on it. You want to determine the maximum memory footprint of your applicatio­n. Explain how you will find out how much physical memory is being used by your applicatio­n?

15. You are running a server applicatio­n which is latency sensitive. You want to ensure that the virtual address space of the server process is locked into physical memory (essentiall­y, none of its pages should be paged out). How would you ensure this? Assume that during the course of its run, the server applicatio­n can map further memory pages by dynamic memory allocation­s or through mmap call. You need to ensure that any future pages mapped by the server also remain locked in physical memory. Explain how this can be achieved.

If you have any favourite programmin­g questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we will meet again next month with the answers to the above questions, I wish all our readers a wonderful and productive new year ahead!

 ??  ?? Sandya Mannarswam­y
Sandya Mannarswam­y

Newspapers in English

Newspapers from India