CodeS­port

In this month’s column, we dis­cuss a few com­puter sci­ence in­ter­view ques­tions.

OpenSource For You - - Contents - By: Sandya Man­nar­swamy The au­thor is an ex­pert in sys­tems soft­ware and is cur­rently work­ing as a re­search sci­en­tist at Xerox In­dia Re­search Cen­tre. Her in­ter­ests in­clude com­pil­ers, pro­gram­ming lan­guages, file sys­tems and nat­u­ral lan­guage pro­cess­ing. If yo

1. Con­tain­ers are be­com­ing the de facto method for ap­pli­ca­tion de­ploy­ment, mov­ing away from vir­tual ma­chines. Docker is a pop­u­lar con­tainer ap­pli­ca­tion which most of you would be fa­mil­iar with (https:// www.docker.com/). Docker is an open source project built on top of Linux con­tain­ers, which al­lows ap­pli­ca­tions to be iso­lated from each other. Can you ex­plain how Docker dif­fers in pro­vid­ing isolation to ap­pli­ca­tions as com­pared to vir­tual ma­chines? Can you com­pare the two meth­ods of ap­pli­ca­tion de­ploy­ment, namely, host­ing an ap­pli­ca­tion on a VM vs host­ing it on a con­tainer, with re­gard to the fol­low­ing cri­te­ria: (a) se­cu­rity, (b) isolation, (c) cost, and (d) fine grained re­source con­trol and shar­ing. 2. When we in­put a query into a search engine such as Google, we get an ‘auto com­plete’ sug­ges­tion for that query, based on past queries that have been tried on the search engine. Can you ex­plain how you would de­sign an auto-com­plete mech­a­nism that would scale well across the mil­lions of queries that can hap­pen con­cur­rently at a time? What is the data struc­ture that you would use to store past queries and au­to­mat­i­cally pro­vide sug­ges­tions? Would you store the com­plete query as is, or would you break it into its con­stituent words? Given that scal­a­bil­ity and cor­rect­ness is the key to this prob­lem, can you ex­plain how the choice of your data struc­ture will ad­dress both?

3. In in­for­ma­tion re­trieval, you are given a set of doc­u­ments and a query. You are asked to re­trieve the set of doc­u­ments that are most rel­e­vant to the query. Many of the com­mon in­for­ma­tion re­trieval tech­niques are built around the mea­sure of TFIDF, where TF stands for the term fre­quency, i.e., the num­ber of times a query key­word ap­pears in a doc­u­ment. For ex­am­ple, given the query ‘Ma­hen­dra Singh Dhoni’s test match ap­pear­ances’, I am asked to find all the doc­u­ments rel­e­vant to it. It is ob­vi­ous that one can com­pute the Term Fre­quency for the query terms, and re­turn a ranked list of doc­u­ments in which the query terms ap­pear most fre­quently. If this suf­fices to re­turn the most rel­e­vant doc­u­ments, why do we need the mea­sure of In­verse Doc­u­ment Fre­quency (IDF)? Can you il­lus­trate, with an ex­am­ple, why IDF is needed in in­for­ma­tion re­trieval? If you ig­nore IDF for rel­e­vant doc­u­ment re­trieval, would it im­pact the pre­ci­sion or re­call, or both?

4. With elec­tions com­ing around in five In­dian states, poll pre­dic­tions have be­come quite a hot topic. How­ever, there have been mixed re­sults for poll pre­dic­tions. For in­stance, the past year has been quite un­suc­cess­ful for many of the data sci­en­tists who are into poll prediction. Nei­ther Brexit nor Trump’s vic­tory were an­tic­i­pated by most poll­sters. Given this con­text, you are asked to pre­dict the party that will win the ma­jor­ity in each of the five states go­ing to the polls in the com­ing months. For any prediction, you need to build mod­els which can an­a­lyse the data and make the prediction. What are the var­i­ous data in­puts that you would be in­ter­ested in get­ting? Or, in sim­plis­tic terms, what are some of the ob­vi­ous fea­tures that you would like to ex­tract from the data for win­ning party pre­dic­tions in a state elec­tion? Is this a clas­si­fi­ca­tion prob­lem or a clus­ter­ing prob­lem? Why do you think poll prediction is a hard prob­lem?

5. You are given a set of doc­u­ments and asked to per­form clus­ter­ing of th­ese doc­u­ments based on doc­u­ment level sim­i­lar­ity. You are fa­mil­iar with dif­fer­ent clus­ter­ing al­go­rithms such as ‘kmeans’, so you are con­fi­dent that you can do the task. How­ever, the in­ter­viewer now adds a sec­ond part to the prob­lem, want­ing you to la­bel the clus­ters. How would you gen­er­ate la­belled clus­ters? What are the means of gen­er­at­ing mean­ing­ful la­bels for the clus­ters? Can the clus­ters be la­belled af­ter, prior to, or dur­ing the clus­ter­ing process it­self? De­scribe your choice of so­lu­tion and why you made it.

6. You are given a set of ar­ti­cles and asked to iden­tify the key phrases cor­re­spond­ing to each of the ar­ti­cles. As­sume that each ar­ti­cle is writ­ten in English and con­tains ap­prox­i­mately 10,000 words. You are asked to find up to seven key phrases for each of the doc­u­ments. How would you de­sign a so­lu­tion for this prob­lem? Would you pro­pose a su­per­vised or un­su­per­vised learn­ing method for this prob­lem? Jus­tify your choice in terms of its ad­van­tages and dis­ad­van­tages. Now you are told that each of the ar­ti­cles has a fixed struc­ture with com­po­nents such as an ab­stract, in­tro­duc­tion, de­scrip­tion, re­sults and con­clu­sion. How would you use this in­for­ma­tion in your so­lu­tion?

7. Con­sider the above ques­tion. Now you are told that you are given a black box al­go­rithm which can do topic mod­el­ling for a set of doc­u­ments. How can you use this in your so­lu­tion for iden­ti­fy­ing key phrases?

8. Word em­bed­ding and sen­tence/para­graph em­bed­ding have be­come pop­u­lar of late with the ar­rival of Google’s word2vec pack­age. Can you ex­plain the in­tu­ition be­hind em­bed­ding? Given a cor­pus of doc­u­ments, can you ex­plain briefly how one could gen­er­ate word em­bed­ding? Given a set of word em­bed­ding that has been gen­er­ated us­ing a large cor­pus such as Wikipedia, can you ex­plain how it can be used in dif­fer­ent Nat­u­ral Lan­guage Pro­cess­ing tasks? What is the ma­jor ad­van­tage of em­bed­ding rep­re­sen­ta­tion over other forms of rep­re­sen­ta­tion such as the count based co-oc­cur­rence ma­trix?

9. Geo-spa­tial ap­pli­ca­tions are get­ting de­vel­oped in large num­bers th­ese days. Con­sider a taxi-fleet man­age­ment com­pany for which the tra­jec­to­ries of its taxis need to be stored in a data­base. Spa­tial ap­pli­ca­tions should be able to query the data­base for a spe­cific tra­jec­tory or seg­ments of a tra­jec­tory. It should be pos­si­ble to an­swer queries of the form, where given a set of lo­ca­tion points, it should be pos­si­ble to re­trieve all the taxis whose tra­jec­to­ries had over­lapped at the lo­ca­tion points. What kind of data­base would you use for stor­ing tra­jec­tory data? To speed up query per­for­mance, what kind of in­dex­ing would you per­form for tra­jec­tory data?

10. You are given an ar­ray A of N in­te­gers. You are also given a num­ber X. You are asked to find the el­e­ment in A that is clos­est to X. You are given the con­straint that you should not sort the ar­ray A. (a) Can you write a C pro­gram to solve this prob­lem? (b) If you are told that the set S con­sists of float­ing point val­ues in­stead of in­te­gers, how would you mod­ify your so­lu­tion for (a) to solve (b)?

11. You are given two ar­rays and asked to find the com­mon el­e­ments be­tween them. (a) As­sume that the two ar­rays only con­tain in­te­ger el­e­ments. (b) As­sume that the two ar­rays con­tain float­ing point val­ues. Would your so­lu­tion for (a) work for (b) as well?

12. For dengue fever, a di­ag­nos­tic test has been de­vel­oped. The new test is 81 per cent re­li­able (i.e., it misses 19 per cent of cases where the pa­tient ac­tu­ally has the dis­ease). It can wrongly la­bel a pa­tient as hav­ing dengue fever in 10 per cent of cases when the pa­tient ac­tu­ally does not have the dis­ease. You are told noth­ing about the pa­tient’s symp­toms or any other clin­i­cal in­for­ma­tion, ex­cept that the pa­tient has tested pos­i­tive for the dengue di­ag­nos­tic test. Can you fig­ure out whether the pa­tient ac­tu­ally has the dengue ill­ness from the above given in­for­ma­tion? If yes, how? If not, why is that so?

13. We are all fa­mil­iar with dif­fer­ent sort­ing and searching al­go­rithms. How­ever, th­ese al­go­rithms typ­i­cally are de­signed for data that fits in phys­i­cal mem­ory. Con­sider that you are given two ma­tri­ces A and B of di­men­sions MXN and NXR. The ma­trix di­men­sions are such that nei­ther of the ma­tri­ces will fit in the main mem­ory. You are asked to write an al­go­rithm for mul­ti­ply­ing the two ma­tri­ces and stor­ing the re­sult into a ma­trix C of di­men­sions MXR on disk. Can you out­line the ap­proach you would take for min­imis­ing the num­ber of disk ac­cesses that your ma­trix_­mul­ti­ply func­tion would per­form? Now con­sider the same prob­lem men­tioned in ques­tion (1) above. You are told that ma­tri­ces A and B are such that most of their el­e­ments are zero. You are again asked to com­pute the ma­trix mul­ti­pli­ca­tion of th­ese two ma­tri­ces. Can you come up with an al­go­rithm which min­imises the num­ber of disk ac­cesses?

14. Con­sider that you are run­ning your ap­pli­ca­tion on a desk­top which has 16GB phys­i­cal mem­ory and has the Linux op­er­at­ing system in­stalled on it. You want to de­ter­mine the max­i­mum mem­ory foot­print of your ap­pli­ca­tion. Ex­plain how you will find out how much phys­i­cal mem­ory is be­ing used by your ap­pli­ca­tion?

15. You are run­ning a server ap­pli­ca­tion which is la­tency sen­si­tive. You want to en­sure that the vir­tual ad­dress space of the server process is locked into phys­i­cal mem­ory (es­sen­tially, none of its pages should be paged out). How would you en­sure this? As­sume that dur­ing the course of its run, the server ap­pli­ca­tion can map fur­ther mem­ory pages by dy­namic mem­ory al­lo­ca­tions or through mmap call. You need to en­sure that any fu­ture pages mapped by the server also re­main locked in phys­i­cal mem­ory. Ex­plain how this can be achieved.

If you have any favourite pro­gram­ming ques­tions/soft­ware top­ics that you would like to dis­cuss on this fo­rum, please send them to me, along with your so­lu­tions and feed­back, at sandyas­m_AT_ya­hoo_DOT_­com. Till we will meet again next month with the an­swers to the above ques­tions, I wish all our read­ers a won­der­ful and pro­duc­tive new year ahead!

Sandya Man­nar­swamy

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.