Ma­chines Learn in Many Di er­ent Ways

This ar­ti­cle gives the reader a bird’s eye view of ma­chine learn­ing mod­els, and solves a use case through Sframes and Python.

OpenSource For You - - Contents -

‘ D ata is the new oil’—and this is not an empty ex­pres­sion do­ing the rounds within the tech in­dus­try. Nowa­days, the strength of a com­pany is also mea­sured by the amount of data it has. Facebook and Google of­fer their ser­vices free in lieu of the vast amount of data they get from their users. These com­pa­nies an­a­lyse the data to ex­tract use­ful in­for­ma­tion. For in­stance, Ama­zon keeps on sug­gest­ing prod­ucts based on your buy­ing trends, and Facebook al­ways sug­gests friends and posts in which you might be in­ter­ested. Data in the raw form is like crude oil—you need to re­fine crude oil to make petrol and diesel. Sim­i­larly, you need to process data to get use­ful in­sights and this is where ma­chine learn­ing comes handy.

Ma­chine learn­ing has dif­fer­ent mod­els such as re­gres­sion, clas­si­fi­ca­tion, clus­ter­ing and sim­i­lar­ity, ma­trix fac­tori­sa­tion, deep learn­ing, etc. In this ar­ti­cle, I will briefly de­scribe these mod­els and also solve a use case us­ing Python.

Lin­ear re­gres­sion: Lin­ear re­gres­sion is stud­ied as a model to un­der­stand the re­la­tion­ship be­tween in­put and out­put nu­mer­i­cal val­ues. The rep­re­sen­ta­tion is a lin­ear equa­tion that com­bines a spe­cific set of in­put val­ues (x), the so­lu­tion to which is the pre­dicted out­put for that set of in­put val­ues. It helps in es­ti­mat­ing the val­ues of the co­ef­fi­cients used in the rep­re­sen­ta­tion with the data that we have avail­able. For ex­am­ple, in a simple re­gres­sion prob­lem (a sin­gle x and a sin­gle y), the form of the model is:

y = B0 + B1*x

Us­ing this model, the price of a house can be pre­dicted based on the data avail­able on nearby homes.

Clas­si­fi­ca­tion model: The clas­si­fi­ca­tion model helps iden­tify the sen­ti­ments of a par­tic­u­lar post. For ex­am­ple, a user re­view can be clas­si­fied as pos­i­tive or neg­a­tive based on the words used in the com­ments. Given one or more in­puts, a clas­si­fi­ca­tion model will try to pre­dict the value of one or more out­comes. Out­comes are la­bels that can be ap­plied to a data set. Emails can be cat­e­gorised as spam or not, based on these mod­els.

Clus­ter­ing and sim­i­lar­ity: This model helps when we are try­ing to find sim­i­lar ob­jects. For ex­am­ple, if I am in­ter­ested in read­ing ar­ti­cles about foot­ball, this model will search for doc­u­ments with cer­tain high-pri­or­ity words and suggest ar­ti­cles about foot­ball. It will also find ar­ti­cles on Messi or Ron­aldo as they are in­volved with foot­ball. TFIDF (term fre­quency - in­verse term fre­quency) is used to eval­u­ate this model.

Deep learn­ing: This is also known as deep struc­tured learn­ing or hi­er­ar­chi­cal learn­ing. It is used for prod­uct rec­om­men­da­tions and im­age com­par­i­son based on pix­els.

Now, let’s ex­plore the con­cept of clus­ter­ing and sim­i­lar­ity, and try to find out the doc­u­ments of our in­ter­est. Let’s as­sume that we want to read an ar­ti­cle on soc­cer. We like an ar­ti­cle and would like to re­trieve an­other ar­ti­cle that we may be in­ter­ested in read­ing.

The ques­tion is how do we do this? In the mar­ket, there are lots and lots of ar­ti­cles that we may or may not be in­ter­ested in. We have to think of a mech­a­nism that sug­gests ar­ti­cles that in­ter­est us. One of the ways is to have a word count of the ar­ti­cle, and suggest ar­ti­cles that have the high­est num­ber of sim­i­lar words. But there is a prob­lem with this model as the doc­u­ment length can be ex­ces­sive, and other un­re­lated doc­u­ments can also be fetched as they might have many sim­i­lar words. For ex­am­ple, ar­ti­cles on foot­ball play­ers’ lives may also get sug­gested, which we are not in­ter­ested in. To solve this, the TF-IDF model comes in. In this model, the words are pri­ori­tised to find the re­lated ar­ti­cles.

Let’s get hands-on for the doc­u­ment retrieval. The first thing you need to do is to in­stall GraphLab Create, on which Python com­mands can be run. GraphLab Create can be down­loaded from https://turi.com/ by fill­ing in a simple form, which asks for a few de­tails such as your name, email id, etc. GraphLab Create has the IPython note­book, which is used to write the Python com­mands. The IPython note­book is sim­i­lar to any other note­book with the ad­van­tage that it can dis­play the graphs on its con­sole.

Open the IPython note­book which runs in the browser at http://lo­cal­host:8888/. Im­port GraphLab us­ing the Python com­mand: im­port graphlab Next, im­port the data in Sframe us­ing the fol­low­ing com­mand: To view the data, use the com­mand:

. peo­ples = graphlab.SFrame(‘peo­ple_wiki.gl/’) peo­ples.head() This dis­plays the top few rows in the con­sole.

The de­tails of the data are the URL, the name of the peo­ple and the text from Wikipedia.

I will now list some of the Python com­mands that can be used to search for re­lated ar­ti­cles on US ex-Pres­i­dent Barack Obama.

1. To ex­plore the en­try for Obama, use the com­mand:

obama = peo­ple[peo­ple[‘name’] == ‘Barack Obama’]

2. Now, sort the word counts for the Obama ar­ti­cle. To turn the dic­tionary of word counts into a table, give the fol­low­ing com­mand:

oba­ma_­word_­coun­t_table = obama[[‘word_­count’]].stack(‘word_ count’, new_­colum­n_­name = [‘word’,’count’])

3. To sort the word counts to show the most common words at the top, type:

oba­ma_­word_­coun­t_table.head()

4. Next, com­pute the TF-IDF for the cor­pus. To give more weight to in­for­ma­tive words, we eval­u­ate them based on their TF-IDF scores, as fol­lows:

peo­ple [‘ word_ count ’]= graph lab. text_ an­a­lyt­ics. count_ words( peo­ple [‘ text ’]) peo­ple.head()

5. To ex­am­ine the TF-IDF for the Obama ar­ti­cle, give the fol­low­ing com­mands:

obama = peo­ple[peo­ple[‘name’] == ‘Barack Obama’] obama[[‘tfidf’]].stack(‘tfidf’,new_­column_ name=[‘word’,’tfidf’]).sort(‘tfidf’,as­cend­ing=False)

Words with the high­est TF-IDF are much more in­for­ma­tive. The TF-IDF of the Obama ar­ti­cle brings up sim­i­lar ar­ti­cles that are re­lated to it, like Iraq, Con­trol, etc.

Ma­chine learn­ing is not a new tech­nol­ogy. It’s been around for years but is gain­ing pop­u­lar­ity only now as many com­pa­nies have started us­ing it.

Fig­ure 1: The peo­ple data loaded in Sframes

Fig­ure 3: Sort­ing the word count

Fig­ure 2: Data gen­er­ated for the Obama ar­ti­cle

Fig­ure 5: TF-IDF for the Obama ar­ti­cle

Fig­ure 4: Com­pute TF-IDF for the cor­pus

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.