Analytics Wish List
Companies want more than they’re getting today from Big Data analytics. But small and big vendors are working to solve the key problems
Companies want more than they’re getting today from Big Data analytics. But small and big vendors are working to solve the key problems
Are we rising to the peak of the Big Data hype cycle, or are we headed into the trough of disillusionment? Your position on that Gartner curve depends on your own company’s progress. Has your company identified any use cases for Big Data analytics? Have you kicked the tires on new platforms such as Hadoop? If you’ve gotten this far, it’s a good bet you’ve also developed a wish list of Big Data capabilities or of problems you’ve yet to solve. It’s this wish list that stands between just storing a pile of useless information and unlocking valuable business insights.
The techniques discussed here — distributed computing, stream processing, machine learning, graph analysis — promise to increase analytics performance, affordability and accessibility. With distributed computing and stream processing, companies are taking on analytics work that demands unprecedented scale and speed — like a bank sizing up every bit of data it has on a customer in a split second in order to serve more relevant ads on a website. We’re seeing machine learning taking on complex analyses. For example, Memorial Sloan-Kettering Cancer Center is experimenting with machine learning to continually monitor medical literature and offer cancer treatment suggestions to supplement doctors’ assessments.
And we’re witnessing the emergence of open source technologies, including Apache Hadoop and R, that let companies use larger and more diverse data types, and apply them to new business analysis problems. Mutual fund company American Century, for example, is writing its own R-based models that use graph analysis techniques to map connections among companies — much like Facebook studies connections among people — to improve its forecasts of financial results.
At this point, IT’s wish list for the next-generation analytics market is long. Most companies still want to see proven analytical tools and methods rather than beta-stage projects. They want easy and familiar SQL or SQL-style analysis, not limited query capabilities and batchy, far-from-realtime performance. The piles of data keep growing, and the variety of data sources companies want to make sense of keeps expanding. Meantime, analytics startups are trying to address the shortcomings of emerging Big Data platforms such as Hadoop. So what follows is but an interim report on the latest and most promising efforts to make sense of the data.
OPEN SOURCE FILLING THE GAPS
Apache Hadoop, the distributed data processing framework now synonymous with Big Data, is widely accepted as a platform for building high-scale, distributed computing applications. Hadoop lets organizations store huge volumes and varieties of data quickly without all the management work demanded by relational databases. Still to be worked out, however, are the best use cases and techniques for running analytics on top of Hadoop.
With current technology, companies can program algorithms in MapReduce, use Hadoop’s HBase NoSQL database to extract data sets and exploit Hive data warehousing infrastructure for SQL-like querying. But early users have identified shortcomings. MapReduce programming is complex, HBase isn’t quite stable and easy to manage, and Hive is slow and its ability to do SQL-style analysis is limited.
Within the Hadoop community, contributors and a growing ecosystem of startup vendors are working to improve tools such as Hive, the Apache
Pig language for doing MapReduce programming, and the Apache Mahout project for deploying machine-learning algorithms.
These startups are finding Hadoop users eager to pioneer new methods. Opower, for instance, sells systems that electrical utilities use to let their customers track their power use. Opower uses Hadoop to combine smart-meter data from millions of utility customers with thermostat, weather and other data. One report shows customers their power consumption versus the average for similar-size homes in the area. Customers can also access bill forecasts online and get alerts that predict their next utility bill.
Consumers armed with that knowledge can do something about their energy use — turn down their thermostats, install programmable thermostats or skew their power use toward lowerrate off-peak hours. Opower, founded in 2007, says the 15 million (and growing) utility customers that use its service have cut their electricity usage by more than 2 terawatts, collectively saving more than USD 220 million.
But the data crunching behind the service isn’t easy. Like many Hadoop practitioners, Opower has developed custom MapReduce processes in Java to extract and process data from HBase and then apply analytical models. Seeking to simplify matters, Opower is deploying off-the-shelf software from WibiData for its HBase analytical work. The software will make two steps much easier, says Drew Hylbert, Opower’s director of infrastructure engineering and a former Yahoo employee who ventured into MapReduce data processing even before Hadoop was invented.
“WibiData will allow us to handle data corrections, which is something we punted on with our homegrown [HBase] schema, and it will allow us to more gracefully add data to HBase in the future as needed,” Hylbert says.
WibiData is one of dozens of startups sprouting to support Hadoop. Launched by Cloudera founder Christophe Bisciglia, WibiData provides Kiji libraries for HBase schema development that the company makes available as free, open source software. Those libraries make it easier to store and extract data from very large HBase databases. The vendor also provides open source analytic MapReduce models and tools that run on top of HBase. The company makes its money on consulting, enterprise support and training.
“The idea behind WibiData is that you can skip the manual MapReduce development process,” says Hylbert. “Rather than going from research engineer to MapReduce engineer to production output, you can apply [repeatable] abstractions for generating insights
Early users have identified shortcomings. MapReduce programming is complex, HBase isn’t quite stable, and Hive is slow and its ability to do SQL-style analysis is limited
across multiple applications.” Customers reuse software instead of having to constantly develop new MapReduce jobs for every new insight required.
Opower is counting on another startup, Platfora, to help it with Big Data visualization — another branch of analytics.
WibiData is geared toward the engineers who look at raw data sets and do their work with statistical models, but other Opower employees need to “see data, plot it out, and slice and dice it in different ways,” Hylbert explains.
“Platfora gives us data visualization and data exploration on top of Hadoop and HBase.”
Opower is just starting to deploy Platfora. But if it lives up to its billing, it could replace a SQL-based approach in which Opower extracts aggregated data sets from Hadoop, moves them to an Infobright columnar SQL database and then uses Pentaho data visualization tools for analysis. The combination of Infobright and Pentaho software is “snappy and easy to use,” Hylbert says, but he would rather skip the process of moving data from Hadoop to a SQL database. Platfora would eliminate that step because it works directly on top of Hadoop.
There are lots of reasons to stick with the mature SQL technology rather than go with Hadoop and related NoSQL alternatives. Vendors offer a vast array of SQL databases, data integration tools, business intelligence software and analytical tools. There are legions of experienced, well-trained SQL database administrators, data analysts, and BI and analytics experts.
But in our latest Analytics, BI and Information Management Survey, 36 percent of the 517 respondents say their companies’ need to manage massive volumes of data is driving their interest in NoSQL. An equal percentage cite the need to manage unstructured data. The percentage of respondents who don’t see a role for NoSQL fell from 47 percent in October 2011 to 37 percent in October 2012.
Opower’s Hylbert says eliminating the SQL database for analysis purposes and consolidating onto a single Hadoop platform reduces operational complexity while leveraging Hadoop’s scalability. “If you have multiple systems, you end up scaling one before the other and you get into coordination efforts, so yes, I’m all for putting everything on the same data resources,” Hylbert explains.
Looking for the best of both worlds, a slew of vendors are looking to bring standard SQL and SQL-like querying to Hadoop. That list includes at least five projects from Hadoop software distributors: Cloudera’s Impala project, MapR’s Apache Drill, IBM’s Big SQL, Hortonworks’ Stinger and EMC’s Pivotal HD with HAWQ SQL query capabilities. If they succeed, they’ll make it easier for companies to do analytics on Hadoop by using well-established SQL-based tools and SQL-trained people.
THE MOVE TO REAL TIME
Another item on the Big Data analyt-
ics wish list is real-time performance. For 4-year-old marketing analytics software vendor Causata, real time means making decisions in less than 50 milliseconds. Customers need that kind of speed to change content, banner ads and marketing offers while their customers are still active on websites and mobile devices.
One Causata customer (the vendor declined to identify it) operates an online banking platform used by midsize banks. The platform provider uses Causata to bring together data from multiple sources: website clickstreams, mobile clickstreams, e-mail interactions with customers, banking transactions and other information about customers and banks’ interactions with them.
Causata doesn’t care what format all that data is in because it uses Hadoop’s HBase NoSQL database for storage. This is the multistructured-data advantage of Hadoop in general. Marketing-related data might include clickstreams, campaign-response data and CRM records. HBase isn’t good at real-time querying, however, so Causata runs Java-based algorithms on its proprietary query engine to improve performance.
“The data is all stored in one place, so when a banking customer logs in, we pull up the profile, run a predictive model against it, identify the probability of interest in one of 10 products or services, and then deliver the right content through an integration with the content management system,” says Brian Stone, Causata’s VP of marketing.
HStreaming is another startup working on high-speed Big Data analysis. It uses stream-processing technology that’s conceptually similar to the event processing engines used by financial trading operations, such as those offered by SAP (Sybase Aleri), Tibco (Complex Event Processing) and Progress Software (Apama). HStreaming says its platform can handle even higher volumes and velocities of data than trading platforms, processing and analyzing some 16 million events per second.
HStreaming takes data directly from always-on sources such as video surveillance cameras, cell towers and sensors and spots patterns in that data while it’s still in flight. Insights are derived even before the data is stored on disk. When the data does get stored, it’s in Hadoop, and HStreaming’s technology offers a form of extract, transform and load for storing raw or transformed streaming data on Hadoop. This is the stored-state version of the data that can be used for historical analysis; HStreaming can also commit its data-analysis results to Hadoop.
The company cites video surveillance, network optimization and mobile advertising as its top applications. In all three cases, real-time insights are the most valuable. HStreaming says national security agencies (it declined to identify them) are working to combine continuous video streams from scores, even hundreds, of cameras with realtime facial-recognition algorithms and police records to spot criminals and alert security.
For network optimization, HStreaming can monitor thousands of remote devices (such as cell towers), spot anomalies and initiate actions such as preventive maintenance. In advertising, HStreaming makes up for the lack of cookies on mobile devices by analyzing behavioral patterns and then serving targeted ads. “We can develop very rich profiles because we know where you are [based on geospatial data], where you will be in half an hour, if you follow a certain pattern every day, what apps you have ...and what you like,” says Jana Uhlig, HStreaming’s CEO. (HStreaming declined to cite customers using these three scenarios.)
Causata and HStreaming are pioneers in putting analytics to work in real time on a Big Data platform, and they’re both working with the HBase database. MapR, a startup that promises one “Big Data platform” using Hadoop, HBase and streaming applications, said that it landed USD 30 million in new venture funding. IBM is going after this same market with InfoSphere streams, and there’s little doubt Oracle, SAP and Tibco will adapt their eventprocessing technologies to Big Data. In HStreaming’s case, the customer supplies the analytics, ranging from “simple rules to identifying outliers for diagnostics to advanced analytics that prescribe optimal actions to take based on real-time clustering and segmentation,” Uhlig says.
MACHINE LEARNING FROM DATA
Developing analytics algorithms and predictive models demands hard-tofind, expensive talent. That scarcity is one reason Big Data, analytics and BI vendors are developing machine-learning approaches.
Today, machine learning shows up in optical character recognition, spam filtering and computer security threat
detection. Learning algorithms are “trained” using real-world data to recognize the digital signatures of scanned text characters, unsolicited e-mail messages or virus bots and malware. Armed with trained models, computers can spot similar patterns in new data. Once a spam model knows what a getrich-quick spam appeal looks like, the model can keep spotting similar appeals without human assistance.
Algorithms can also continue to learn from the data streaming in from operational systems. Amazon.com and Netflix, for example, use algorithms to spot patterns in customer transactions so they can recommend other books or movies. When a new book or movie starts racking up sales and rentals, the site can start recommending it as soon as the system discerns a preference pattern in the data.
The traditional, human-powered way to build such models is to have Ph.D. or highly trained data experts create them using R, SAS or SPSS software. Machine learning promises to take the modeler at least partly out of the process. Machine learning dates back to the late 1950s, when it was defined by computing pioneer Arthur Samuel, an IBM employee later turned Stanford professor, as a field in which computers learn without being explicitly programmed. Machine learning techniques are a big part of cognitive computing, a movement IBM CEO Ginni Rometty is predicting will define the next wave of computing. In the first wave, computers were used to tabulate data. The second wave saw the development of programmable computers that could execute instructions.
“The third wave will be about computers that learn,” Rometty told business and government leaders in a March speech in New York. Computers have to learn by themselves, she said, “because information is too big and growing too fast, so you can’t program for it.”
The connection at IBM is Watson, the Jeopardy-playing, cognitive-computing machine now being trained to serve as a medical adviser for oncologists, among other applications. Over the last year, Watson has been trained on more than 600,000 pieces of medical evidence and 2 million pages of text from 42 medical journals and clinical trials in the field of oncology. IBM partner Memorial Sloan-Kettering added details on 1,500 lung cancer cases, including physicians’ notes, lab results and clinical research on specialized treatments based on the genetics of tumors.
Combining general knowledge of cancers and accepted treatment regimes with the 1,500-plus specific case examples, Watson can make predictions about new lung cancer cases and sug- gest treatments. Doctors interact with Watson through a tablet app that lets them review each patient’s case. The app serves up a prioritized list of recommended tests and treatments, along with confidence scores. For example, given a case of N-stage lung cancer in a patient of X age, Y genomic makeup and Z symptoms, treatment A is recommended with 95 percent confidence, treatment B is recommended with 75 percent confidence and treatment C is recommend with 65 percent confidence. Sloan-Kettering is testing this technology with lung-cancer patients, but it has yet to enter full production deployment.
It’s a far more profound and sophisticated use of cognitive computing than targeting ads or optimizing cell phone networks, but IBM’s also training Watson for more prosaic roles in financial services and call center operations.
FROM FACEBOOK TO WALL STREET
Social networks are contributing to the scale and variability of data companies now collect and encounter. Facebook is among the pioneers using graph analysis to uncover the web of user relationships, by studying nodes (representing people, companies, locations and so on) and edges (the often complex relationships among those nodes).
Graph analysis, like many of the techniques discussed here, has been kicking around for decades, but Facebook has elevated it to new heights of scale and sophistication. It uses graph analysis to uncover the relationships within its 1 billion-person social network, whether they’re friends, classmates, colleagues or people who share your like of Rihanna or Red Bull.
Graph analysis isn’t a well-developed domain like the SQL relational domain because it’s not suitable for a broad range of uses, says Jay Parikh, VP of Infrastructure Engineering at Facebook. But for its sweet spot of understanding network relationships, graph analysis is compelling.
“For Facebook, it’s all about how to manage more data and keep it up to date because friendships, relationships, check-ins, photos and
all of those edges [among them] are constantly changing and being created all the time,” Parikh says. “We need to derive insights and wrap a rich user experience around that.” But if you think this technique is only relevant to vast public social networks, consider the use case of American Century Investments, the mutual fund company that includes the Livestrong family of funds. American Century uses graph analysis to predict the performance of the companies its fund managers invest in. It started experimenting with the technique about 18 months ago as part of a revamp of its analytics infrastructure.
American Century had used a variety of proprietary analytics tools and frameworks from financial services industry IT and information suppliers such as Thomson Reuters. But the company wanted the flexibility to work with more data and to develop a wider range of specialized analytics to set its research, and therefore its investments, apart from the competition’s, says Tal Sansani, a portfolio manager and quantitative analyst at the firm. To optimize its investment portfolios, for example, the company performs simulations, scenario analyses and financial stress tests that weren’t supported by any single third-party tool or framework.
“We need to calibrate our models in specific ways and not be held back by a limited list of capabilities,” Sansani says. “We wanted to build the models ourselves, so we did that in R rather than rely too much on third-party frameworks.”
American Century still buys plenty of proprietary data from Thomson Reuters and other sources. Using the open source R statistical programming language gives the company more freedom to develop broad analytics capabilities. It has essentially built its own customized framework, and Sansani says other financial services firms are taking the same approach. In American Century’s case, the software for running R-based models is from commercial software and support provider Revolution Analytics.
American Century started rolling out its R-based production deployments within the last three months, and one of the first is a graph analysis application based on the Rigraph package. The application tracks revenue flows among manufacturers and their suppliers. Apple, for example, has suppliers of chips and screens just as car manufacturers have suppliers of components and parts. American Century combines public and proprietary data on those buying relationships, and it applies graph analyses to get a clearer understanding of the likely performance of suppliers. These forecasts are more accurate than what could be developed with forecasts based on quarters- old public financial reports, Sansani says.
PUT IT ALL TOGETHER
All these analytical techniques around Hadoop and R might leave data professionals feeling betwixt and between — not sure what to do with the triedand-true versus the promising new approaches. The way forward will require a mix of techniques, data, applications and tools.
Facebook uses myriad systems and techniques, including its own derivations of Hadoop, Cassandra and emerging real-time analytical technologies. Parikh describes graph analysis as “yet another piece of cool technology that allows enterprises to carve off a couple of applications and optimize them,” but he warns that the use cases are limited. The tough part is finding the right mix of technologies and techniques, since building Big Data systems raises the risk that you ”either waste a lot of money or miss huge opportunities in your business,” Parikh says.
“Threading that needle is what every tech-driven company in the world will have to do, and most companies won’t be able to do it well.”
On the wasting-money extreme, companies might store too much information with little sense of what they’re trying to analyze. Or they might build blinding-fast analysis engines to chase insights that don’t translate into higher sales or profits. On the missing-opportunities extreme, companies might fail to capture information. Or that information may be so partitioned among business units that companies won’t be able to pull together the key insights. Threading that needle will require a blend of approaches to get at one practical analytical success at a time.