An­a­lyt­ics Wish List

Com­pa­nies want more than they’re get­ting to­day from Big Data an­a­lyt­ics. But small and big ven­dors are work­ing to solve the key prob­lems

InformationWeek - - Contents - DOUG HEN­SCHEN tweet @hen­schen

Com­pa­nies want more than they’re get­ting to­day from Big Data an­a­lyt­ics. But small and big ven­dors are work­ing to solve the key prob­lems

Are we ris­ing to the peak of the Big Data hype cy­cle, or are we headed into the trough of dis­il­lu­sion­ment? Your po­si­tion on that Gart­ner curve de­pends on your own com­pany’s progress. Has your com­pany iden­ti­fied any use cases for Big Data an­a­lyt­ics? Have you kicked the tires on new plat­forms such as Hadoop? If you’ve got­ten this far, it’s a good bet you’ve also de­vel­oped a wish list of Big Data ca­pa­bil­i­ties or of prob­lems you’ve yet to solve. It’s this wish list that stands be­tween just stor­ing a pile of use­less in­for­ma­tion and un­lock­ing valu­able busi­ness in­sights.

The tech­niques dis­cussed here — dis­trib­uted com­put­ing, stream pro­cess­ing, ma­chine learn­ing, graph anal­y­sis — prom­ise to in­crease an­a­lyt­ics per­for­mance, af­ford­abil­ity and ac­ces­si­bil­ity. With dis­trib­uted com­put­ing and stream pro­cess­ing, com­pa­nies are tak­ing on an­a­lyt­ics work that de­mands un­prece­dented scale and speed — like a bank siz­ing up ev­ery bit of data it has on a cus­tomer in a split sec­ond in or­der to serve more rel­e­vant ads on a web­site. We’re see­ing ma­chine learn­ing tak­ing on com­plex analy­ses. For ex­am­ple, Me­mo­rial Sloan-Ket­ter­ing Cancer Cen­ter is ex­per­i­ment­ing with ma­chine learn­ing to con­tin­u­ally mon­i­tor med­i­cal lit­er­a­ture and of­fer cancer treat­ment sug­ges­tions to sup­ple­ment doc­tors’ as­sess­ments.

And we’re wit­ness­ing the emer­gence of open source tech­nolo­gies, in­clud­ing Apache Hadoop and R, that let com­pa­nies use larger and more di­verse data types, and ap­ply them to new busi­ness anal­y­sis prob­lems. Mu­tual fund com­pany Amer­i­can Century, for ex­am­ple, is writ­ing its own R-based mod­els that use graph anal­y­sis tech­niques to map con­nec­tions among com­pa­nies — much like Face­book stud­ies con­nec­tions among people — to im­prove its fore­casts of fi­nan­cial re­sults.

At this point, IT’s wish list for the next-gen­er­a­tion an­a­lyt­ics mar­ket is long. Most com­pa­nies still want to see proven an­a­lyt­i­cal tools and meth­ods rather than beta-stage projects. They want easy and fa­mil­iar SQL or SQL-style anal­y­sis, not limited query ca­pa­bil­i­ties and batchy, far-from-re­al­time per­for­mance. The piles of data keep grow­ing, and the va­ri­ety of data sources com­pa­nies want to make sense of keeps ex­pand­ing. Mean­time, an­a­lyt­ics star­tups are try­ing to ad­dress the short­com­ings of emerg­ing Big Data plat­forms such as Hadoop. So what fol­lows is but an in­terim re­port on the lat­est and most promis­ing ef­forts to make sense of the data.


Apache Hadoop, the dis­trib­uted data pro­cess­ing frame­work now syn­ony­mous with Big Data, is widely ac­cepted as a plat­form for build­ing high-scale, dis­trib­uted com­put­ing ap­pli­ca­tions. Hadoop lets or­ga­ni­za­tions store huge vol­umes and va­ri­eties of data quickly with­out all the man­age­ment work de­manded by re­la­tional data­bases. Still to be worked out, how­ever, are the best use cases and tech­niques for run­ning an­a­lyt­ics on top of Hadoop.

With cur­rent tech­nol­ogy, com­pa­nies can pro­gram al­go­rithms in MapRe­duce, use Hadoop’s HBase NoSQL data­base to ex­tract data sets and ex­ploit Hive data ware­hous­ing in­fra­struc­ture for SQL-like query­ing. But early users have iden­ti­fied short­com­ings. MapRe­duce pro­gram­ming is com­plex, HBase isn’t quite sta­ble and easy to man­age, and Hive is slow and its abil­ity to do SQL-style anal­y­sis is limited.

Within the Hadoop com­mu­nity, con­trib­u­tors and a grow­ing ecosys­tem of startup ven­dors are work­ing to im­prove tools such as Hive, the Apache

Pig lan­guage for do­ing MapRe­duce pro­gram­ming, and the Apache Ma­hout project for de­ploy­ing ma­chine-learn­ing al­go­rithms.

These star­tups are find­ing Hadoop users ea­ger to pioneer new meth­ods. Opower, for in­stance, sells sys­tems that elec­tri­cal util­i­ties use to let their cus­tomers track their power use. Opower uses Hadoop to com­bine smart-me­ter data from mil­lions of util­ity cus­tomers with ther­mo­stat, weather and other data. One re­port shows cus­tomers their power con­sump­tion ver­sus the aver­age for sim­i­lar-size homes in the area. Cus­tomers can also ac­cess bill fore­casts on­line and get alerts that pre­dict their next util­ity bill.

Con­sumers armed with that knowl­edge can do some­thing about their en­ergy use — turn down their ther­mostats, in­stall pro­gram­mable ther­mostats or skew their power use to­ward low­er­rate off-peak hours. Opower, founded in 2007, says the 15 mil­lion (and grow­ing) util­ity cus­tomers that use its ser­vice have cut their elec­tric­ity us­age by more than 2 ter­awatts, col­lec­tively sav­ing more than USD 220 mil­lion.

But the data crunch­ing be­hind the ser­vice isn’t easy. Like many Hadoop prac­ti­tion­ers, Opower has de­vel­oped cus­tom MapRe­duce pro­cesses in Java to ex­tract and process data from HBase and then ap­ply an­a­lyt­i­cal mod­els. Seek­ing to sim­plify mat­ters, Opower is de­ploy­ing off-the-shelf soft­ware from WibiData for its HBase an­a­lyt­i­cal work. The soft­ware will make two steps much eas­ier, says Drew Hyl­bert, Opower’s di­rec­tor of in­fra­struc­ture en­gi­neer­ing and a for­mer Ya­hoo em­ployee who ven­tured into MapRe­duce data pro­cess­ing even be­fore Hadoop was in­vented.

“WibiData will al­low us to han­dle data cor­rec­tions, which is some­thing we punted on with our home­grown [HBase] schema, and it will al­low us to more grace­fully add data to HBase in the fu­ture as needed,” Hyl­bert says.

WibiData is one of dozens of star­tups sprout­ing to sup­port Hadoop. Launched by Cloud­era founder Christophe Bis­ciglia, WibiData pro­vides Kiji li­braries for HBase schema de­vel­op­ment that the com­pany makes avail­able as free, open source soft­ware. Those li­braries make it eas­ier to store and ex­tract data from very large HBase data­bases. The ven­dor also pro­vides open source an­a­lytic MapRe­duce mod­els and tools that run on top of HBase. The com­pany makes its money on con­sult­ing, en­ter­prise sup­port and train­ing.

“The idea be­hind WibiData is that you can skip the man­ual MapRe­duce de­vel­op­ment process,” says Hyl­bert. “Rather than go­ing from re­search en­gi­neer to MapRe­duce en­gi­neer to pro­duc­tion out­put, you can ap­ply [re­peat­able] ab­strac­tions for gen­er­at­ing in­sights

Early users have iden­ti­fied short­com­ings. MapRe­duce pro­gram­ming is com­plex, HBase isn’t quite sta­ble, and Hive is slow and its abil­ity to do SQL-style anal­y­sis is limited

across mul­ti­ple ap­pli­ca­tions.” Cus­tomers re­use soft­ware in­stead of hav­ing to con­stantly de­velop new MapRe­duce jobs for ev­ery new in­sight re­quired.

Opower is count­ing on an­other startup, Plat­fora, to help it with Big Data vi­su­al­iza­tion — an­other branch of an­a­lyt­ics.

WibiData is geared to­ward the en­gi­neers who look at raw data sets and do their work with sta­tis­ti­cal mod­els, but other Opower em­ploy­ees need to “see data, plot it out, and slice and dice it in dif­fer­ent ways,” Hyl­bert ex­plains.

“Plat­fora gives us data vi­su­al­iza­tion and data ex­plo­ration on top of Hadoop and HBase.”

Opower is just start­ing to de­ploy Plat­fora. But if it lives up to its billing, it could re­place a SQL-based ap­proach in which Opower ex­tracts ag­gre­gated data sets from Hadoop, moves them to an In­fo­bright colum­nar SQL data­base and then uses Pen­taho data vi­su­al­iza­tion tools for anal­y­sis. The com­bi­na­tion of In­fo­bright and Pen­taho soft­ware is “snappy and easy to use,” Hyl­bert says, but he would rather skip the process of mov­ing data from Hadoop to a SQL data­base. Plat­fora would elim­i­nate that step be­cause it works di­rectly on top of Hadoop.


There are lots of rea­sons to stick with the ma­ture SQL tech­nol­ogy rather than go with Hadoop and re­lated NoSQL al­ter­na­tives. Ven­dors of­fer a vast ar­ray of SQL data­bases, data in­te­gra­tion tools, busi­ness in­tel­li­gence soft­ware and an­a­lyt­i­cal tools. There are le­gions of ex­pe­ri­enced, well-trained SQL data­base ad­min­is­tra­tors, data an­a­lysts, and BI and an­a­lyt­ics ex­perts.

But in our lat­est An­a­lyt­ics, BI and In­for­ma­tion Man­age­ment Sur­vey, 36 per­cent of the 517 re­spon­dents say their com­pa­nies’ need to man­age mas­sive vol­umes of data is driv­ing their in­ter­est in NoSQL. An equal per­cent­age cite the need to man­age un­struc­tured data. The per­cent­age of re­spon­dents who don’t see a role for NoSQL fell from 47 per­cent in Oc­to­ber 2011 to 37 per­cent in Oc­to­ber 2012.

Opower’s Hyl­bert says elim­i­nat­ing the SQL data­base for anal­y­sis pur­poses and con­sol­i­dat­ing onto a sin­gle Hadoop plat­form re­duces op­er­a­tional com­plex­ity while lev­er­ag­ing Hadoop’s scal­a­bil­ity. “If you have mul­ti­ple sys­tems, you end up scal­ing one be­fore the other and you get into co­or­di­na­tion ef­forts, so yes, I’m all for putting ev­ery­thing on the same data re­sources,” Hyl­bert ex­plains.

Look­ing for the best of both worlds, a slew of ven­dors are look­ing to bring stan­dard SQL and SQL-like query­ing to Hadoop. That list in­cludes at least five projects from Hadoop soft­ware dis­trib­u­tors: Cloud­era’s Im­pala project, MapR’s Apache Drill, IBM’s Big SQL, Hor­ton­works’ Stinger and EMC’s Piv­otal HD with HAWQ SQL query ca­pa­bil­i­ties. If they suc­ceed, they’ll make it eas­ier for com­pa­nies to do an­a­lyt­ics on Hadoop by us­ing well-es­tab­lished SQL-based tools and SQL-trained people.


An­other item on the Big Data an­a­lyt-

ics wish list is real-time per­for­mance. For 4-year-old mar­ket­ing an­a­lyt­ics soft­ware ven­dor Causata, real time means mak­ing de­ci­sions in less than 50 mil­lisec­onds. Cus­tomers need that kind of speed to change con­tent, ban­ner ads and mar­ket­ing of­fers while their cus­tomers are still ac­tive on web­sites and mo­bile de­vices.

One Causata cus­tomer (the ven­dor de­clined to iden­tify it) op­er­ates an on­line bank­ing plat­form used by mid­size banks. The plat­form provider uses Causata to bring to­gether data from mul­ti­ple sources: web­site click­streams, mo­bile click­streams, e-mail in­ter­ac­tions with cus­tomers, bank­ing trans­ac­tions and other in­for­ma­tion about cus­tomers and banks’ in­ter­ac­tions with them.

Causata doesn’t care what for­mat all that data is in be­cause it uses Hadoop’s HBase NoSQL data­base for stor­age. This is the mul­ti­struc­tured-data ad­van­tage of Hadoop in gen­eral. Mar­ket­ing-re­lated data might in­clude click­streams, cam­paign-re­sponse data and CRM records. HBase isn’t good at real-time query­ing, how­ever, so Causata runs Java-based al­go­rithms on its pro­pri­etary query en­gine to im­prove per­for­mance.

“The data is all stored in one place, so when a bank­ing cus­tomer logs in, we pull up the pro­file, run a pre­dic­tive model against it, iden­tify the prob­a­bil­ity of in­ter­est in one of 10 prod­ucts or ser­vices, and then deliver the right con­tent through an in­te­gra­tion with the con­tent man­age­ment sys­tem,” says Brian Stone, Causata’s VP of mar­ket­ing.

HStream­ing is an­other startup work­ing on high-speed Big Data anal­y­sis. It uses stream-pro­cess­ing tech­nol­ogy that’s con­cep­tu­ally sim­i­lar to the event pro­cess­ing en­gines used by fi­nan­cial trad­ing op­er­a­tions, such as those of­fered by SAP (Sy­base Aleri), Tibco (Com­plex Event Pro­cess­ing) and Progress Soft­ware (Apama). HStream­ing says its plat­form can han­dle even higher vol­umes and ve­loc­i­ties of data than trad­ing plat­forms, pro­cess­ing and an­a­lyz­ing some 16 mil­lion events per sec­ond.

HStream­ing takes data di­rectly from al­ways-on sources such as video sur­veil­lance cam­eras, cell tow­ers and sen­sors and spots pat­terns in that data while it’s still in flight. In­sights are de­rived even be­fore the data is stored on disk. When the data does get stored, it’s in Hadoop, and HStream­ing’s tech­nol­ogy of­fers a form of ex­tract, trans­form and load for stor­ing raw or trans­formed stream­ing data on Hadoop. This is the stored-state ver­sion of the data that can be used for his­tor­i­cal anal­y­sis; HStream­ing can also com­mit its data-anal­y­sis re­sults to Hadoop.

The com­pany cites video sur­veil­lance, net­work op­ti­miza­tion and mo­bile ad­ver­tis­ing as its top ap­pli­ca­tions. In all three cases, real-time in­sights are the most valu­able. HStream­ing says na­tional se­cu­rity agencies (it de­clined to iden­tify them) are work­ing to com­bine con­tin­u­ous video streams from scores, even hun­dreds, of cam­eras with re­al­time fa­cial-recog­ni­tion al­go­rithms and po­lice records to spot crim­i­nals and alert se­cu­rity.

For net­work op­ti­miza­tion, HStream­ing can mon­i­tor thou­sands of re­mote de­vices (such as cell tow­ers), spot anom­alies and ini­ti­ate ac­tions such as pre­ven­tive main­te­nance. In ad­ver­tis­ing, HStream­ing makes up for the lack of cook­ies on mo­bile de­vices by an­a­lyz­ing be­hav­ioral pat­terns and then serv­ing tar­geted ads. “We can de­velop very rich profiles be­cause we know where you are [based on geospa­tial data], where you will be in half an hour, if you fol­low a cer­tain pat­tern ev­ery day, what apps you have ...and what you like,” says Jana Uh­lig, HStream­ing’s CEO. (HStream­ing de­clined to cite cus­tomers us­ing these three sce­nar­ios.)

Causata and HStream­ing are pi­o­neers in putting an­a­lyt­ics to work in real time on a Big Data plat­form, and they’re both work­ing with the HBase data­base. MapR, a startup that prom­ises one “Big Data plat­form” us­ing Hadoop, HBase and stream­ing ap­pli­ca­tions, said that it landed USD 30 mil­lion in new ven­ture fund­ing. IBM is go­ing af­ter this same mar­ket with In­fo­S­phere streams, and there’s lit­tle doubt Or­a­cle, SAP and Tibco will adapt their event­pro­cess­ing tech­nolo­gies to Big Data. In HStream­ing’s case, the cus­tomer sup­plies the an­a­lyt­ics, rang­ing from “sim­ple rules to iden­ti­fy­ing out­liers for di­ag­nos­tics to ad­vanced an­a­lyt­ics that pre­scribe op­ti­mal ac­tions to take based on real-time clus­ter­ing and seg­men­ta­tion,” Uh­lig says.


De­vel­op­ing an­a­lyt­ics al­go­rithms and pre­dic­tive mod­els de­mands hard-tofind, ex­pen­sive talent. That scarcity is one rea­son Big Data, an­a­lyt­ics and BI ven­dors are de­vel­op­ing ma­chine-learn­ing ap­proaches.

To­day, ma­chine learn­ing shows up in op­ti­cal char­ac­ter recog­ni­tion, spam fil­ter­ing and com­puter se­cu­rity threat

de­tec­tion. Learn­ing al­go­rithms are “trained” us­ing real-world data to rec­og­nize the dig­i­tal sig­na­tures of scanned text char­ac­ters, un­so­licited e-mail mes­sages or virus bots and mal­ware. Armed with trained mod­els, com­put­ers can spot sim­i­lar pat­terns in new data. Once a spam model knows what a get­rich-quick spam ap­peal looks like, the model can keep spot­ting sim­i­lar ap­peals with­out hu­man as­sis­tance.

Al­go­rithms can also con­tinue to learn from the data stream­ing in from op­er­a­tional sys­tems. Ama­ and Net­flix, for ex­am­ple, use al­go­rithms to spot pat­terns in cus­tomer trans­ac­tions so they can rec­om­mend other books or movies. When a new book or movie starts rack­ing up sales and rentals, the site can start rec­om­mend­ing it as soon as the sys­tem dis­cerns a pref­er­ence pat­tern in the data.

The tra­di­tional, hu­man-pow­ered way to build such mod­els is to have Ph.D. or highly trained data ex­perts cre­ate them us­ing R, SAS or SPSS soft­ware. Ma­chine learn­ing prom­ises to take the modeler at least partly out of the process. Ma­chine learn­ing dates back to the late 1950s, when it was de­fined by com­put­ing pioneer Arthur Sa­muel, an IBM em­ployee later turned Stan­ford pro­fes­sor, as a field in which com­put­ers learn with­out be­ing ex­plic­itly pro­grammed. Ma­chine learn­ing tech­niques are a big part of cog­ni­tive com­put­ing, a move­ment IBM CEO Ginni Rometty is pre­dict­ing will de­fine the next wave of com­put­ing. In the first wave, com­put­ers were used to tab­u­late data. The sec­ond wave saw the de­vel­op­ment of pro­gram­mable com­put­ers that could ex­e­cute in­struc­tions.

“The third wave will be about com­put­ers that learn,” Rometty told busi­ness and govern­ment lead­ers in a March speech in New York. Com­put­ers have to learn by them­selves, she said, “be­cause in­for­ma­tion is too big and grow­ing too fast, so you can’t pro­gram for it.”

The con­nec­tion at IBM is Wat­son, the Jeop­ardy-play­ing, cog­ni­tive-com­put­ing ma­chine now be­ing trained to serve as a med­i­cal ad­viser for on­col­o­gists, among other ap­pli­ca­tions. Over the last year, Wat­son has been trained on more than 600,000 pieces of med­i­cal ev­i­dence and 2 mil­lion pages of text from 42 med­i­cal jour­nals and clin­i­cal tri­als in the field of on­col­ogy. IBM part­ner Me­mo­rial Sloan-Ket­ter­ing added de­tails on 1,500 lung cancer cases, in­clud­ing physi­cians’ notes, lab re­sults and clin­i­cal re­search on spe­cial­ized treat­ments based on the ge­net­ics of tu­mors.

Com­bin­ing gen­eral knowl­edge of can­cers and ac­cepted treat­ment regimes with the 1,500-plus spe­cific case ex­am­ples, Wat­son can make pre­dic­tions about new lung cancer cases and sug- gest treat­ments. Doc­tors in­ter­act with Wat­son through a tablet app that lets them re­view each pa­tient’s case. The app serves up a pri­or­i­tized list of rec­om­mended tests and treat­ments, along with con­fi­dence scores. For ex­am­ple, given a case of N-stage lung cancer in a pa­tient of X age, Y ge­nomic makeup and Z symp­toms, treat­ment A is rec­om­mended with 95 per­cent con­fi­dence, treat­ment B is rec­om­mended with 75 per­cent con­fi­dence and treat­ment C is rec­om­mend with 65 per­cent con­fi­dence. Sloan-Ket­ter­ing is test­ing this tech­nol­ogy with lung-cancer pa­tients, but it has yet to en­ter full pro­duc­tion de­ploy­ment.

It’s a far more pro­found and so­phis­ti­cated use of cog­ni­tive com­put­ing than tar­get­ing ads or op­ti­miz­ing cell phone net­works, but IBM’s also train­ing Wat­son for more pro­saic roles in fi­nan­cial ser­vices and call cen­ter op­er­a­tions.


So­cial net­works are con­tribut­ing to the scale and vari­abil­ity of data com­pa­nies now col­lect and en­counter. Face­book is among the pi­o­neers us­ing graph anal­y­sis to un­cover the web of user re­la­tion­ships, by study­ing nodes (rep­re­sent­ing people, com­pa­nies, lo­ca­tions and so on) and edges (the of­ten com­plex re­la­tion­ships among those nodes).

Graph anal­y­sis, like many of the tech­niques dis­cussed here, has been kick­ing around for decades, but Face­book has el­e­vated it to new heights of scale and so­phis­ti­ca­tion. It uses graph anal­y­sis to un­cover the re­la­tion­ships within its 1 bil­lion-per­son so­cial net­work, whether they’re friends, class­mates, col­leagues or people who share your like of Ri­hanna or Red Bull.

Graph anal­y­sis isn’t a well-de­vel­oped do­main like the SQL re­la­tional do­main be­cause it’s not suit­able for a broad range of uses, says Jay Parikh, VP of In­fra­struc­ture En­gi­neer­ing at Face­book. But for its sweet spot of un­der­stand­ing net­work re­la­tion­ships, graph anal­y­sis is com­pelling.

“For Face­book, it’s all about how to man­age more data and keep it up to date be­cause friend­ships, re­la­tion­ships, check-ins, pho­tos and

all of those edges [among them] are con­stantly chang­ing and be­ing cre­ated all the time,” Parikh says. “We need to de­rive in­sights and wrap a rich user ex­pe­ri­ence around that.” But if you think this tech­nique is only rel­e­vant to vast pub­lic so­cial net­works, con­sider the use case of Amer­i­can Century In­vest­ments, the mu­tual fund com­pany that in­cludes the Live­strong fam­ily of funds. Amer­i­can Century uses graph anal­y­sis to pre­dict the per­for­mance of the com­pa­nies its fund man­agers in­vest in. It started ex­per­i­ment­ing with the tech­nique about 18 months ago as part of a re­vamp of its an­a­lyt­ics in­fra­struc­ture.

Amer­i­can Century had used a va­ri­ety of pro­pri­etary an­a­lyt­ics tools and frame­works from fi­nan­cial ser­vices in­dus­try IT and in­for­ma­tion sup­pli­ers such as Thom­son Reuters. But the com­pany wanted the flex­i­bil­ity to work with more data and to de­velop a wider range of spe­cial­ized an­a­lyt­ics to set its re­search, and there­fore its in­vest­ments, apart from the com­pe­ti­tion’s, says Tal Sansani, a port­fo­lio man­ager and quan­ti­ta­tive an­a­lyst at the firm. To op­ti­mize its in­vest­ment portfolios, for ex­am­ple, the com­pany per­forms sim­u­la­tions, sce­nario analy­ses and fi­nan­cial stress tests that weren’t sup­ported by any sin­gle third-party tool or frame­work.

“We need to cal­i­brate our mod­els in spe­cific ways and not be held back by a limited list of ca­pa­bil­i­ties,” Sansani says. “We wanted to build the mod­els our­selves, so we did that in R rather than rely too much on third-party frame­works.”

Amer­i­can Century still buys plenty of pro­pri­etary data from Thom­son Reuters and other sources. Us­ing the open source R sta­tis­ti­cal pro­gram­ming lan­guage gives the com­pany more free­dom to de­velop broad an­a­lyt­ics ca­pa­bil­i­ties. It has es­sen­tially built its own cus­tom­ized frame­work, and Sansani says other fi­nan­cial ser­vices firms are tak­ing the same ap­proach. In Amer­i­can Century’s case, the soft­ware for run­ning R-based mod­els is from commercial soft­ware and sup­port provider Revo­lu­tion An­a­lyt­ics.

Amer­i­can Century started rolling out its R-based pro­duc­tion de­ploy­ments within the last three months, and one of the first is a graph anal­y­sis ap­pli­ca­tion based on the Ri­graph pack­age. The ap­pli­ca­tion tracks rev­enue flows among man­u­fac­tur­ers and their sup­pli­ers. Ap­ple, for ex­am­ple, has sup­pli­ers of chips and screens just as car man­u­fac­tur­ers have sup­pli­ers of com­po­nents and parts. Amer­i­can Century com­bines pub­lic and pro­pri­etary data on those buy­ing re­la­tion­ships, and it ap­plies graph analy­ses to get a clearer un­der­stand­ing of the likely per­for­mance of sup­pli­ers. These fore­casts are more ac­cu­rate than what could be de­vel­oped with fore­casts based on quar­ters- old pub­lic fi­nan­cial re­ports, Sansani says.


All these an­a­lyt­i­cal tech­niques around Hadoop and R might leave data pro­fes­sion­als feel­ing be­twixt and be­tween — not sure what to do with the triedand-true ver­sus the promis­ing new ap­proaches. The way for­ward will re­quire a mix of tech­niques, data, ap­pli­ca­tions and tools.

Face­book uses myr­iad sys­tems and tech­niques, in­clud­ing its own deriva­tions of Hadoop, Cas­san­dra and emerg­ing real-time an­a­lyt­i­cal tech­nolo­gies. Parikh de­scribes graph anal­y­sis as “yet an­other piece of cool tech­nol­ogy that al­lows en­ter­prises to carve off a cou­ple of ap­pli­ca­tions and op­ti­mize them,” but he warns that the use cases are limited. The tough part is find­ing the right mix of tech­nolo­gies and tech­niques, since build­ing Big Data sys­tems raises the risk that you ”ei­ther waste a lot of money or miss huge op­por­tu­ni­ties in your busi­ness,” Parikh says.

“Thread­ing that nee­dle is what ev­ery tech-driven com­pany in the world will have to do, and most com­pa­nies won’t be able to do it well.”

On the wast­ing-money ex­treme, com­pa­nies might store too much in­for­ma­tion with lit­tle sense of what they’re try­ing to an­a­lyze. Or they might build blind­ing-fast anal­y­sis en­gines to chase in­sights that don’t trans­late into higher sales or prof­its. On the miss­ing-op­por­tu­ni­ties ex­treme, com­pa­nies might fail to cap­ture in­for­ma­tion. Or that in­for­ma­tion may be so par­ti­tioned among busi­ness units that com­pa­nies won’t be able to pull to­gether the key in­sights. Thread­ing that nee­dle will re­quire a blend of ap­proaches to get at one prac­ti­cal an­a­lyt­i­cal suc­cess at a time.

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.