Data Science and Ma­chine Learn­ing: Work­ing To­gether to Make Ma­chines Smarter

This ar­ti­cle clar­i­fies the role of data science in re­la­tion to ma­chine learn­ing, and dis­cusses var­i­ous as­pects of both.

OpenSource For You - - For U & Me -

Google is now a syn­onym for on­line search. Most of us will agree with this, be­cause when­ever we try to get in­for­ma­tion on some­thing we don’t know, we say, “Google it!” Have you ever given a thought to how Google comes up with the an­swers to dif­fer­ent ques­tions in an in­stant? Google and other such search en­gines make use of dif­fer­ent data science al­go­rithms and ma­chine learn­ing tech­niques to come up with the best results for all our search queries, and that too, in less than a sec­ond. Google pro­cesses more than 20 petabytes of struc­tured and un­struc­tured data daily and, even then, is able to in­stantly flash our search results. Had there been no data science and ma­chine learn­ing, Google would not have been able to per­form all this and it would not have been the allper­va­sive search engine we all de­pend on to­day. Data science is one of the roots that sup­port the tree of the dig­i­tal world.

Data science is also known as data-driven science as it deals with var­i­ous sci­en­tific pro­cesses, meth­ods and sys­tems to ex­tract knowl­edge or in­sights from large sets of data — ei­ther un­struc­tured or struc­tured. We all have ac­cess to huge amounts of data, which is about many as­pects of our lives – it could be re­lated to com­mu­ni­ca­tion, shop­ping, read­ing the news, search­ing for in­for­ma­tion, ex­press­ing our opin­ions, etc. All this is be­ing used to ex­tract use­ful in­sights by em­ploy­ing dif­fer­ent data science tech­niques. Data science is ba­si­cally a con­cept that uni­fies statis­tics with data anal­y­sis in or­der to an­a­lyse and re­late real world ac­tiv­i­ties with data. It em­ploys dif­fer­ent tech­niques and the­o­ries that are drawn from many fields — from within the broad ar­eas of statis­tics, math­e­mat­ics, in­for­ma­tion science and com­puter science, be­sides the var­i­ous sub-do­mains of ma­chine learn­ing, clus­ter anal­y­sis, clas­si­fi­ca­tion, data­bases, data min­ing and vi­su­al­i­sa­tion.

Ac­cord­ing to the Tur­ing Award win­ner Jim Gray, data science is the fourth par­a­digm of science. Gray as­serts that ev­ery­thing about science is rapidly chang­ing be­cause of the im­pact of in­for­ma­tion tech­nol­ogy and the data del­uge. Data science plays a cru­cial role in trans­form­ing the in­for­ma­tion col­lected dur­ing datafi­ca­tion and adds a value to it. Datafi­ca­tion is noth­ing but the process of tak­ing dif­fer­ent as­pects of life and turn­ing these to data. For in­stance, Twit­ter datafies dif­fer­ent stray thoughts, LinkedIn datafies the pro­fes­sional net­works, and so on. We take the help of dif­fer­ent data science tech­niques to ex­tract use­ful parts out of the col­lected in­for­ma­tion dur­ing datafi­ca­tion.

Drew Con­way is fa­mous for his Venn di­a­gram def­i­ni­tion of data science. He ap­plied it to study and an­a­lyse one of the big­gest prob­lems of the globe –ter­ror­ism. If we take a look at his Venn di­a­gram def­i­ni­tion, data science is the union of hack­ing skills, sta­tis­ti­cal and math­e­mat­i­cal knowl­edge, and sub­stan­tive ex­per­tise about the spe­cific sub­ject. Ac­cord­ing to him, data science is the civil en­gi­neer­ing of data. It re­quires a prac­ti­cal knowl­edge of dif­fer­ent tools and ma­te­ri­als, cou­pled with a the­o­ret­i­cal un­der­stand­ing of what’s pos­si­ble.

The work­flow for data science

Data science com­prises sev­eral se­quences of pro­cesses which are fol­lowed to de­duce use­ful in­sights from the raw set of

data. This ul­ti­mately helps the sys­tem to make de­ci­sions.

Let us have a look at the dif­fer­ent pro­cesses fol­lowed in data science.

Col­lec­tion of raw data: This is the first step im­ple­mented in data science and deals with the col­lec­tion of ac­tual raw data, on which dif­fer­ent data science op­er­a­tions need to be per­formed. There are broadly two ways to do this:

1. We can pick one or many tools to col­lect data au­to­mat­i­cally from dif­fer­ent data sources. This op­tion is widely used in or­der to col­lect data from large data sources. We just need to copy-paste a small code snip­pet into our web­site and we are ready to go (e.g., Hot­jar, Google An­a­lyt­ics, etc).

2. We can also col­lect the data for our­selves us­ing a JavaScript code snip­pet that sends the data in a .csv plain text file on the server. This is a bit dif­fi­cult to im­ple­ment as it re­quires some cod­ing skills. But if we think about the long term, this so­lu­tion is more prof­itable.

Data pro­cess­ing: This refers to the re­fine­ment of raw data that has been col­lected dur­ing the data col­lec­tion process. We all know that the raw data is un­pro­cessed and un­or­gan­ised. It needs to be ar­ranged and or­gan­ised so that it be­comes eas­ier to per­form op­er­a­tions on it. Once the data is pro­cessed, we get an out­put data which is the pro­cessed, cat­e­gorised and sum­marised ver­sion. Data pro­cess­ing is re­quired in most of the ex­per­i­ments and sur­veys. The col­lected raw data some­times con­tains too much data to an­a­lyse it sen­si­bly. This is es­pe­cially the case when we do re­search us­ing com­put­ers as this may pro­duce large sets of data. The data then needs to be or­gan­ised or ma­nip­u­lated us­ing the de­con­struc­tion tech­nique.

Data set clean­ing: This is the process of re­mov­ing un­wanted data from the pro­cessed data set and keep­ing only what’s re­quired for anal­y­sis. This helps to re­duce the large set of data to a smaller one by re­mov­ing the in­con­sis­tent or in­cor­rect data, and makes it eas­ier to per­form dif­fer­ent anal­y­sis tasks on it.

Ex­ploratory data anal­y­sis: This ap­proach is used to an­a­lyse the data sets in or­der to sum­marise their im­por­tant char­ac­ter­is­tics, of­ten with the help of vis­ual meth­ods. A sta­tis­ti­cal model can also be used for anal­y­sis, but pri­mar­ily, ex­ploratory data anal­y­sis is for visu­al­is­ing what the data can tell us be­yond the for­mal mod­el­ling or the hy­poth­e­sis test­ing task. This ap­proach was pro­moted by John Tukey to en­cour­age dif­fer­ent data sci­en­tists to ex­plore the data, and hence pos­si­bly for­mu­late the hy­pothe­ses that could lead to new meth­ods of data col­lec­tion and ex­per­i­ments. This is dif­fer­ent from the ini­tial data anal­y­sis, which fo­cuses mostly on check­ing the as­sump­tions re­quired for model fit­ting, the hy­poth­e­sis test­ing, the han­dling of dif­fer­ent miss­ing val­ues and mak­ing trans­for­ma­tions of vari­ables as re­quired.

Mod­els and al­go­rithms: Once the data is cleansed, some sets of data will need ex­ploratory anal­y­sis whereas other sets can be di­rectly used for the se­lec­tion of data mod­els and al­go­rithms. This phase of data science deals with the process of se­lect­ing the right and ap­pro­pri­ate al­go­rithm on the ba­sis of the data set ob­tained after data clean­ing, and also on the ba­sis of the knowl­edge ob­tained about the data set dur­ing ex­ploratory data anal­y­sis. The al­go­rithm cho­sen is such that it’s most ef­fi­cient for the avail­able data set. This process also in­cludes the de­sign, de­vel­op­ment and se­lec­tion of the data mod­els, which can be used to per­form the re­quired op­er­a­tions on the data, to ob­tain the re­quired data prod­uct.

Re­port com­mu­ni­ca­tion: This is the part of data science that deals with gen­er­at­ing and de­vel­op­ing vis­ual re­ports in the form of graphs and pie-charts, which can be used by data sci­en­tists to an­a­lyse the data pat­terns and make the ap­pro­pri­ate de­ci­sions. This de­ci­sion is the fi­nal out­put, which is then utilised in dif­fer­ent ap­pli­ca­tions.

Data prod­uct: This is the fi­nal data prod­uct, which is used to con­tin­u­ously im­prove and change the ap­pli­ca­tion sys­tem whose data is an­a­lysed. This can be con­sid­ered as the end prod­uct, which rep­re­sents the whole set of op­er­a­tions per­formed on the col­lected raw data set.

What is ma­chine learn­ing?

Ma­chine learn­ing is a part of com­puter science that gives any sys­tem the abil­ity to learn on its own with­out be­ing pro­grammed. It makes a ma­chine learn in the same way as hu­man be­ings learn by them­selves. Just as we learn any sys­tem on the ba­sis of our ex­pe­ri­ence and the knowl­edge gained after analysing the sys­tem, even ma­chines can an­a­lyse and study the sys­tem’s be­hav­iour or its out­put data and learn how to take de­ci­sions on that ba­sis. This is the back­bone of ar­ti­fi­cial in­tel­li­gence. It makes ma­chines get into a self-learn­ing mode with­out any explicit pro­gram­ming. When the ma­chine is fed with new data, it learns, grows and changes by it­self.

Ma­chine learn­ing has evolved from the con­cept of pat­tern recog­ni­tion and com­pu­ta­tional learn­ing the­ory in ar­ti­fi­cial in­tel­li­gence. It ex­plores the study and con­struc­tion of dif­fer­ent al­go­rithms that can learn from data and make pre­dic­tions on them. These al­go­rithms do not fol­low the static pro­gram in­struc­tions, but make data-driven pre­dic­tions or de­ci­sions by build­ing a model from some sam­ple in­puts.

There are three types of ma­chine learn­ing, dif­fer­en­ti­ated on the ba­sis of the learn­ing sig­nal avail­able to any learn­ing sys­tem. 1. Su­per­vised learn­ing: In this type of learn­ing, the ma­chine is pre­sented with few ex­am­ple in­puts and also their de­sired out­puts, which are given by a teacher. The goal is to learn a general rule that maps in­puts to out­puts. 2. Un­su­per­vised learn­ing: This is a type of ma­chine learn­ing in which no la­bels are given to the learn­ing al­go­rithm, leav­ing it to find the struc­ture in its in­put on its own. 3. Re­in­force­ment learn­ing: Un­der this learn­ing sys­tem, a com­puter pro­gram ac­tu­ally in­ter­acts with a dy­namic en­vi­ron­ment for which it must per­form a spe­cific goal (for ex­am­ple, driv­ing a ve­hi­cle or play­ing a game against an op­po­nent).

How is ma­chine learn­ing re­lated to data science?

Ma­chine learn­ing is very closely re­lated to (and some­times over­laps with) data science or com­pu­ta­tional statis­tics, as both fo­cus on mak­ing pre­dic­tions with the help of ma­chines or com­put­ers. It has strong ties with math­e­mat­i­cal op­ti­mi­sa­tion, which pro­vides dif­fer­ent meth­ods and the­o­ries to op­ti­mise learn­ing sys­tems. Ma­chine learn­ing is of­ten com­bined with data science, with the lat­ter ac­tu­ally fo­cus­ing more on ex­ploratory data anal­y­sis, and this is known as un­su­per­vised learn­ing.

If we talk specif­i­cally about the field of data science, ma­chine learn­ing is used to de­vise var­i­ous com­plex mod­els and al­go­rithms that lend them­selves to pre­dic­tion. This is also known as pre­dic­tive an­a­lyt­ics. All these an­a­lyt­i­cal mod­els al­low data sci­en­tists, re­searchers, en­gi­neers and an­a­lysts to pro­duce re­li­able and re­peat­able de­ci­sions in or­der to un­cover var­i­ous hid­den in­sights by learn­ing from the his­tor­i­cal re­la­tion­ships and trends in the large sets of data.

Data anal­y­sis has been tra­di­tion­ally char­ac­terised by the trial and er­ror ap­proach, and we all know that this be­comes im­pos­si­ble to use when there are large and het­ero­ge­neous sets of data to be an­a­lysed. The avail­abil­ity of large data is di­rectly pro­por­tional to the dif­fi­culty of de­vel­op­ing new pre­dic­tive mod­els that work ac­cu­rately. All the tra­di­tional sta­tis­ti­cal so­lu­tions work for static anal­y­sis, which is lim­ited to the anal­y­sis of sam­ples frozen in time. Ma­chine learn­ing has emerged as a so­lu­tion to all this chaos, propos­ing dif­fer­ent clever al­ter­na­tives to an­a­lyse huge volumes of data. It is able to pro­duce ac­cu­rate results and analy­ses by de­vel­op­ing var­i­ous ef­fi­cient and fast work­ing al­go­rithms for the real-time pro­cess­ing of data.

Some ap­pli­ca­tions of ma­chine learn­ing

Ma­chine learn­ing has been im­ple­mented in a num­ber of ap­pli­ca­tions. Some of them are:

1. Google’s self-driv­ing car

2. On­line rec­om­men­da­tion en­gines such as friend rec­om­men­da­tion on Facebook

3. Var­i­ous of­fer rec­om­men­da­tions from Ama­zon

4. Cy­ber fraud de­tec­tion

5. Op­ti­cal char­ac­ter recog­ni­tion (OCR)

The role of ma­chine learn­ing in data science

1. Ma­chine learn­ing helps to an­a­lyse large chunks of data eas­ily and hence eases the work of data sci­en­tists in an au­to­mated process.

2. Ma­chine learn­ing has changed the way data in­ter­pre­ta­tion and ex­trac­tion works by in­volv­ing sev­eral au­to­matic sets of generic meth­ods, which have re­placed sta­tis­ti­cal tech­niques.

3. It pro­vides in­sights that help to cre­ate ap­pli­ca­tions that are more in­tel­li­gent and data-driven, and hence im­proves their op­er­a­tion and busi­ness pro­cesses, lead­ing to eas­ier de­ci­sion mak­ing.

4. Ma­chine learn­ing soft­ware sys­tems im­prove their per­for­mance more and more, as peo­ple use them.

This oc­curs be­cause the al­go­rithms used in them learn from the large set of data gen­er­ated on the ba­sis of the users’ be­hav­iour.

5. It helps in in­vent­ing new ways to solve some sud­den and abrupt chal­lenges in the sys­tem, on the ba­sis of the ex­pe­ri­ence gained by the ma­chine while analysing the large data sets and be­hav­iour of the sys­tem.

6. The in­creas­ing use of ma­chine learn­ing in in­dus­tries acts as a cat­a­lyst to make data science in­creas­ingly rel­e­vant.

Some of the ma­chine learn­ing tech­niques used in data science

1. De­ci­sion tree learn­ing: This is a ma­chine learn­ing tech­nique that uses a de­ci­sion tree as the pre­dic­tive model, which fur­ther maps ob­ser­va­tions about an item to the con­clu­sions about the tar­get value of the item.

2. As­so­ci­a­tion rule learn­ing: This is a method used for dis­cov­er­ing sev­eral in­ter­est­ing re­la­tions be­tween the vari­ables in large data­bases.

3. Ar­ti­fi­cial neu­ral net­works: Such learn­ing tech­niques are also called neu­ral net­works. These are learn­ing al­go­rithms that are in­spired by the struc­ture and func­tional as­pects of bi­o­log­i­cal neu­ral net­works. Dif­fer­ent com­pu­ta­tions are ac­tu­ally struc­tured in terms of in­ter­con­nected groups of ar­ti­fi­cially de­signed neu­rons, which help to process in­for­ma­tion us­ing the con­nec­tion­ist ap­proach to com­pu­ta­tion. All the mod­ern neu­ral net­works are ba­si­cally non-linear sta­tis­ti­cal tools used for data mod­el­ling. They are usu­ally used to model sev­eral com­plex re­la­tion­ships be­tween the in­puts-out­puts and to find pat­terns in the data. 4. In­duc­tive logic pro­gram­ming (ILP): This ap­proach uses log­i­cal pro­gram­ming as a rep­re­sen­ta­tion for sev­eral in­put ex­am­ples, the back­ground knowl­edge and the hy­pothe­ses. If we are given an en­cod­ing of any known back­ground knowl­edge with a set of ex­am­ples, which rep­re­sent a log­i­cal data­base of facts, then an ILP sys­tem will eas­ily de­rive a hy­poth­e­sised logic pro­gram which en­tails all the pos­i­tive and no neg­a­tive ex­am­ples. This type of pro­gram­ming con­sid­ers any type of pro­gram­ming lan­guage for rep­re­sent­ing the hy­pothe­ses, such as func­tional pro­grams.

5. Clus­ter­ing: Clus­ter anal­y­sis is a tech­nique used for the as­sign­ment of a set of dif­fer­ent ob­ser­va­tions into var­i­ous sub­sets (also called clus­ters) so that the ob­ser­va­tions present within the same clus­ter are sim­i­lar to some pre­des­ig­nated cri­te­ria, whereas the ob­ser­va­tions drawn from all the dif­fer­ent clus­ters are dis­sim­i­lar. All the dif­fer­ent clus­ter­ing tech­niques have dif­fer­ent as­sump­tions on the struc­ture of data, which is of­ten de­fined by some sim­i­lar­ity met­ric and is eval­u­ated, for ex­am­ple, by in­ter­nal com­pact­ness and the sep­a­ra­tion be­tween dif­fer­ent clus­ters. It is ba­si­cally an un­su­per­vised learn­ing method and one of the com­mon tech­niques used for sta­tis­ti­cal data anal­y­sis. 6. Bayesian net­works: A Bayesian net­work is a prob­a­bilis­tic graph­i­cal model which rep­re­sents a set of ran­dom vari­ables and all their con­di­tional in­de­pen­den­cies us­ing a di­rected acyclic graph. For in­stance, a Bayesian net­work can rep­re­sent the prob­a­bilis­tic re­la­tion­ships be­tween dif­fer­ent dis­eases and their symp­toms. If we are given the symp­toms, then the net­work can eas­ily com­pute the prob­a­bil­i­ties of the pres­ence of var­i­ous dis­eases. Very ef­fi­cient al­go­rithms are used to per­form the in­fer­ence and learn­ing.

7. Re­in­force­ment learn­ing: This is a tech­nique re­lated to how an agent ought to take dif­fer­ent ac­tions in an en­vi­ron­ment in or­der to max­imise some no­tion of the long-term re­ward. This type of al­go­rithm at­tempts to find a pol­icy that maps dif­fer­ent states of the world to the dif­fer­ent ac­tions the agent ought to take in those states. This type of learn­ing dif­fers from the su­per­vised learn­ing prob­lem, for which cor­rect in­put/out­put pairs are not pre­sented, nor are the sub-op­ti­mal ac­tions ex­plic­itly cor­rected.

Fig­ure 1: A Venn di­a­gram def­i­ni­tion of data science by Drew Con­way (Im­age credit: googleim­ages.com)

Fig­ure 2: The top al­go­rithms and meth­ods used by data sci­en­tists (Im­age credit: googleim­ages.com)

Fig­ure 4: Dif­fer­ent com­po­nents of data science in a health­care or­gan­i­sa­tion (Im­age credit: googleim­ages.com)

Fig­ure 3: Data science work­flow (Im­age credit: googleim­ages.com)

Fig­ure 5: Ma­chine learn­ing and data science (Im­age credit: googleim­ages.com)

Fig­ure 6: The ma­chine learn­ing process (Im­age credit: googleim­ages.com)

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.