CodeS­port

In this month’s col­umn, we dis­cuss the so­lu­tions to some of the ques­tions we dis­cussed in last month’s col­umn.

OpenSource For You - - Contents -

As you know, it is our prac­tice to fea­ture com­puter sci­ence in­ter­view ques­tions in the col­umn reg­u­larly, for our stu­dent read­ers to pre­pare for in­ter­views. A few of our stu­dent read­ers have re­quested me to pro­vide the an­swers to the ques­tions as well. Given that this col­umn is meant to kin­dle the reader’s cu­rios­ity and seek out so­lu­tions in­stead of pro­vid­ing canned an­swers to ques­tions, I gen­er­ally do not pro­vide so­lu­tions to the ques­tions fea­tured here. So, in­stead of pro­vid­ing the com­plete so­lu­tions, as re­quested, it may help the read­ers to dis­cuss the high level so­lu­tions in more de­tail, while they work out the con­crete an­swers. Hence, in this month’s col­umn, we will dis­cuss some of the top­ics that the fea­tured ques­tions are based on.

One of the ques­tions from last month’s col­umn was on the dif­fer­ence be­tween vir­tual ma­chines and con­tain­ers. Vir­tual ma­chines are in­tended to fa­cil­i­tate server vir­tu­al­i­sa­tion, wherein the phys­i­cal hard­ware re­sources are vir­tu­alised by a spe­cial piece of soft­ware known as the vir­tual ma­chine mon­i­tor or hy­per­vi­sor. This con­trols the vir­tual ma­chines that run on top of it. Each vir­tual ma­chine has its own copy of its op­er­at­ing sys­tem. Typ­i­cally, a VMM or hy­per­vi­sor runs on top of the host op­er­at­ing sys­tem (though it is pos­si­ble to have a bare metal hy­per­vi­sor which runs on top of the firmware it­self). Each VM runs its own op­er­at­ing sys­tem, which is known as the guest op­er­at­ing sys­tem. These guest op­er­at­ing sys­tems can be dif­fer­ent from each other. On the other hand, con­tain­ers run on top of a sin­gle host op­er­at­ing sys­tem, which is why they are some­times re­ferred to as sup­port­ing op­er­at­ing sys­tem vir­tu­al­i­sa­tion, whereas vir­tual ma­chines are re­ferred to as sup­port­ing server vir­tu­al­i­sa­tion. Typ­i­cally, mul­ti­ple con­tain­ers sit on top of a sin­gle host op­er­at­ing sys­tem, which it­self runs on top of a phys­i­cal server. All the con­tain­ers share the same host op­er­at­ing sys­tem ker­nel and the li­braries.

Now, let us con­sider the case in which we have a bug in an op­er­at­ing sys­tem. If this op­er­at­ing sys­tem ver­sion was run­ning as a guest OS in a VM en­vi­ron­ment, the im­pact of the OS bug would be con­fined only to that VM run­ning the guest OS. On the other hand, as­sume that this OS was the host OS on top of which all con­tain­ers were hosted. All the con­tain­ers would be im­pacted by the same bug, since there is no OS iso­la­tion in a con­tainer en­vi­ron­ment. In that case, what then is the ad­van­tage of con­tain­ers? The very fact that con­tain­ers share the same OS com­po­nents makes them very light­weight and easy to de­ploy. The con­tainer den­sity is much higher than the VM den­sity, given the same phys­i­cal con­fig­u­ra­tion. A de­tailed dis­cus­sion of the se­cu­rity is­sues of con­tain­ers can be found in the blog post https://open­source.com/ busi­ness/14/7/docker-se­cu­rity-selinux.

Ques­tion (3) from last month’s col­umn was about in­for­ma­tion re­trieval. Read­ers were asked why we need to use in­verse doc­u­ment fre­quency as a mea­sure for rel­e­vant doc­u­ments re­trieval, given a query. As we know, TF-IDF is pri­mar­ily used as the mea­sure for in­for­ma­tion re­trieval. TF stands for term fre­quency. This is a mea­sure of how fre­quently a term oc­curs in a sin­gle doc­u­ment. If T1 is a query term and it oc­curs with a high fre­quency in doc­u­ment D1, it seems ob­vi­ous that D1 may be a rel­e­vant doc­u­ment when a user is search­ing for query T1. How­ever, let us con­sider that our doc­u­ment col­lec­tion is a set of doc­u­ments as­so­ci­ated with cricket, and our query is ‘wis­den player’. Now, it is pos­si­ble that many doc­u­ments con­tain the query term ‘player’

and so its TF will be high for many doc­u­ments. Hence, this term ‘player’ may not help us to dis­crim­i­nate the set of doc­u­ments that are more rel­e­vant to the user’s query. How­ever, the term ‘wis­den’ is a more sig­nif­i­cant term as it is likely to be present only in the rel­e­vant doc­u­ments. So we need to get those doc­u­ments that con­tain the dis­crim­i­nat­ing term ‘wis­den’ and not in­clude all those doc­u­ments which con­tain the non-dis­crim­i­nat­ing term ‘player’.

This is where the in­verse doc­u­ment fre­quency comes into play. In­verse doc­u­ment fre­quency (IDF) is an in­verse of the mea­sure of the num­ber of doc­u­ments that con­tain the term. Hence, for a dis­crim­i­nat­ing term, since it would be present in fewer doc­u­ments, its in­verse doc­u­ment fre­quency would be much higher; hence, it can help off­set the ef­fect of non-dis­crim­i­nat­ing terms for which the IDF will be much lower. Typ­i­cally, both TF and IDF are scaled and nor­malised ap­pro­pri­ately. Hence, IDF is typ­i­cally de­fined as IDF (term t) = Log (N/Nt), where N is the to­tal num­ber of doc­u­ments in the cor­pus and Nt is the to­tal num­ber of doc­u­ments in which term ‘t’ is present. I leave it to the reader to fig­ure out whether IDF helps to im­prove pre­ci­sion, re­call, or both.

Ques­tion (4) in last month’s col­umn was about pre­dict­ing the re­sults of elec­tion polls in states. This prob­lem can be an­a­lysed in mul­ti­ple dif­fer­ent ways. Let’s as­sume that we want to pre­dict the win­ning party for a par­tic­u­lar state, given that there are mul­ti­ple par­ties con­test­ing the elec­tion. We also know the past his­tory of elec­tions in that state, as well as var­i­ous other fea­tures col­lected from past data. This can be mod­elled as a multi-la­bel clas­si­fi­ca­tion, where the win­ner is one of N par­ties con­test­ing the elec­tions.

On the other hand, if we are con­sid­er­ing all the con­stituen­cies of the state, can we clus­ter the con­stituen­cies into dif­fer­ent clus­ters such that each clus­ter rep­re­sents the set of con­stituen­cies won by a spe­cific party. We opt for clus­ter­ing if we don’t have la­belled data for the win­ners in pre­vi­ous elec­tions. In that case, we just want to use the fea­tures/at­tributes of dif­fer­ent con­stituen­cies and clus­ter them. How­ever, note that though we get a set of clus­ters, we don’t know what these clus­ters rep­re­sent. Hence, a hu­man ex­pert needs to la­bel these clus­ters. Once these clus­ters are la­belled, an un­la­belled con­stituency can be la­belled by as­sign­ing the clus­ter la­bel that is clos­est to it. So this ques­tion can be mod­elled in mul­ti­ple ways — ei­ther as a clas­si­fi­ca­tion prob­lem or as a clus­ter­ing prob­lem, depend­ing on avail­able data for anal­y­sis.

Ques­tion (8) was about word em­bed­dings, which is the hottest topic cur­rently in nat­u­ral lan­guage pro­cess­ing. Fre­quency or count based ap­proaches have ear­lier been widely used in nat­u­ral lan­guage pro­cess­ing. For in­stance, la­tent se­man­tic anal­y­sis or la­tent se­man­tic in­dex­ing is one of the ma­jor ap­proaches in this di­rec­tion. These ap­proaches have typ­i­cally been based on global counts, wherein for each word, we con­sider all the words that have co-oc­curred with it, to con­struct its rep­re­sen­ta­tion. Hence, the en­tire cor­pus has to be pro­cessed be­fore we get a rep­re­sen­ta­tion for each word based on co-oc­cur­rence counts.

On the other hand, word em­bed­dings are based on a small lo­cal con­text win­dow which is moved over all the con­texts in the cor­pus and, at any point in time, a con­tin­u­ous vec­tor rep­re­sen­ta­tion is avail­able for each word which gets up­dated as we ex­am­ine more con­texts. Note that co-oc­cur­rence based counts are typ­i­cally sparse rep­re­sen­ta­tions since a word will not co-oc­cur with many other words in the vo­cab­u­lary. On the other hand, em­bed­dings are dense rep­re­sen­ta­tions since they are of much lower di­men­sions.

A de­tailed dis­cus­sion on the re­la­tion be­tween em­bed­dings and global count based rep­re­sen­ta­tions can be found in the paper ‘GloVe: Global Vec­tors for Word Rep­re­sen­ta­tion’ by Pen­ning­ton et al avail­able at http:// nlp.stan­ford.edu/pubs/glove.pdf. It would be good for the read­ers to ex­per­i­ment with both Google word2vec em­bed­dings avail­able on­line (this in­cludes word vec­tors for a vo­cab­u­lary of three mil­lion words trained on a Google News dataset) as well as the Glove vec­tors avail­able at http://nlp.stan­ford.edu/projects/glove/. Word em­bed­dings and, sub­se­quently, para­graph and doc­u­ment em­bed­dings are widely used as in­puts in var­i­ous nat­u­ral lan­guage pro­cess­ing tasks. In fact, they are used as in­put rep­re­sen­ta­tion for many of the deep learn­ing based text pro­cess­ing sys­tems. There are a num­ber of word2vec im­ple­men­ta­tions avail­able, the pop­u­lar ones be­ing from the Gen­sim pack­age (https://radim­re­hurek.com/gen­sim/ mod­els/word2vec.html), which is widely used. It would be use­ful for the read­ers to un­der­stand how word em­bed­dings are learnt us­ing neu­ral net­works with­out re­quir­ing any la­belled data. While word em­bed­dings are typ­i­cally used in deep neu­ral net­works, the word2vec model for gen­er­at­ing the word em­bed­dings is not a deep ar­chi­tec­ture, but a shal­low neu­ral net­work ar­chi­tec­ture. More de­tails can be found in the word2vec paper https://pa­pers.nips. cc/paper/5021-dis­trib­uted-rep­re­sen­ta­tions-of-words-and­phrases-and-their-com­po­si­tion­al­ity.pdf.

If you have any favourite pro­gram­ming ques­tions/ soft­ware top­ics that you would like to dis­cuss on this fo­rum, please send them to me, along with your so­lu­tions and feed­back, at sandyas­m_AT_ya­hoo_DOT_­com. Till we meet again next month, here’s wish­ing all our read­ers a won­der­ful and pro­duc­tive year ahead!

By: Sandya Man­nar­swamy The au­thor is an ex­pert in sys­tems soft­ware and is cur­rently work­ing as a re­search sci­en­tist at Xerox In­dia Re­search Cen­tre. Her in­ter­ests in­clude com­pil­ers, pro­gram­ming lan­guages, file sys­tems and nat­u­ral lan­guage pro­cess­ing. If you are pre­par­ing for sys­tems soft­ware in­ter­views, you may find it use­ful to visit Sandya’s LinkedIn group ‘Com­puter Sci­ence In­ter­view Train­ing In­dia’ at http://www.linkedin.com/ groups?home=HYPERLINK “http://www.linkedin.com/group s?home=&gid=2339182”&HYPERLINK “http://www.linkedin. com/groups?home=&gid=2339182”gid=2339182

Sandya Man­nar­swamy

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.