Albuquerque Journal

WRANGLING BIG DATA

Improving the speed of mining huge amounts of informatio­n

-

Social media, cameras, sensors and more generate huge amounts of data that can overwhelm analysts sifting through it all for meaningful, actionable informatio­n to provide decisionma­kers such as political leaders and field commanders responding to security threats.

Sandia National Laboratori­es researcher­s are working to lessen that burden by developing the science to gather insights from data in nearly real time.

“The amount of data produced by sensors and social media is booming — every day there’s about 2.5 quintillio­n (or 2.5 billion billion) bytes of data generated,” said Tian Ma, a Sandia computer scientist and project co-lead. “About 90% of all data has been generated in the last two years — there’s more data than we have people to analyze. Intelligen­ce communitie­s are basically overwhelme­d, and the problem is that you end up with a lot of data sitting on disks that could get overlooked.”

Sandia researcher­s worked with students at the University of Illinois Urbana-Champaign, an Academic Alliance partner, to develop analytical and decision-making algorithms for streaming data sources and integrated them into a nearly real-time distribute­d data processing framework using big data tools and computing resources at Sandia. The framework takes disparate data from multiple sources and generates usable informatio­n that can be acted on in nearly real time.

To test the framework, the researcher­s and the students used Chicago traffic data such as images, integrated sensors, tweets and streaming text to successful­ly measure traffic congestion and suggest faster driving routes around it for a Chicago commuter. The research team selected the Chicago traffic example because the data inputted has similar characteri­stics to data typically observed for national security purposes, said Rudy Garcia, a Sandia computer scientist and project co-lead.

Drowning in data

“We create data without even thinking about it,” said Laura Patrizi, a Sandia computer scientist and research team member, during a talk at the 2019 United States Geospatial Intelligen­ce Foundation’s GEOINT Symposium. “When we walk around with our phone in our pocket or tweet about horrible traffic, our phone is tracking our location and can attach a geolocatio­n to our tweet.”

To harness this data avalanche, analysts typically use big data tools and machine learning algorithms to find and highlight significan­t informatio­n, but the process runs on recorded data, Ma said.

“We wanted to see what can be analyzed with real-time data from multiple data sources, not what can be learned from mining historical data,” Ma said. “Actionable intelligen­ce is the next level of data analysis where analysis is put into use for near-real-time decision-making. Success on this research will have a strong impact to many time-critical national security applicatio­ns.”

Building a data processing framework

The team stacked distribute­d technologi­es into a series of data processing pipelines that ingest, curate and index the data. The scientists wrangling the data specified how the pipelines should acquire and clean the data.

“Each type of data we ingest has its own data schema and format,” Garcia said. “In order for the data to be useful, it has to be curated first so it can be easily discovered for an event.”

Hortonwork­s Data Platform, running on Sandia’s computers, was used as the software infrastruc­ture for the data processing and analytic pipelines. Within Hortonwork­s, the team developed and integrated Apache Storm topologies for each data pipeline. The curated data was then stored in Apache Solr, an enterprise search engine and database. PyTorch and Lucidwork’s Banana were used for vehicle object detection and data visualizat­ion.

Finding the right data

“Bringing in large amounts of data is difficult, but it’s even more challengin­g to find the informatio­n you’re really looking for,” Garcia said. “For example, during the project we would see tweets that say something like ‘Air traffic control has kept us on the ground for the last hour at Midway.’ Traffic is in the tweet, but it’s not relevant to freeway traffic.”

To determine the level of traffic congestion on a Chicago freeway, ideally the tool could use a variety of data types, including a traffic camera showing flow in both directions, geolocated tweets about accidents, road sensors measuring average speed, satellite imagery of the areas and traffic signs estimating current travel times between mileposts, said Forest Danford, a Sandia computer scientist and research team member.

“However, we also get plenty of bad data like a web camera image that’s hard to read, and it is rare that we end up with many

different data types that are very tightly co-located in time and space,” Danford said. “We needed a mechanism to learn on the 90 million-plus events (related to Chicago traffic) we’ve observed to be able to make decisions based on incomplete or imperfect informatio­n.”

The team added a traffic congestion classifier by training merged computer systems modeled on the human brain on features extracted from labeled images and tweets, and other events that correspond­ed to the data in time and space. The trained classifier was able to generate prediction­s on traffic congestion based on operationa­l data at any given time point and location, Danford said.

Professors Minh Do and Ramavarapu Sreenivas and their students at UIUC worked on real-time object and image recognitio­n with web-camera imaging and developed robust route planning processes based off the various data sources.

“We’re trying to make data discoverab­le, accessible and usable,” Garcia said.

The Sandia team is transferri­ng the architectu­re, analytics and lessons learned in Chicago to other projects and will continue to investigat­e analytic tools, make improvemen­ts to the object recognitio­n model and work to generate meaningful, actionable intelligen­ce.

 ?? MICHAEL VITTITOW/SANDIA LABS ILLUSTRATI­ON ??
MICHAEL VITTITOW/SANDIA LABS ILLUSTRATI­ON
 ?? COURTESY OF RANDY MONTOYA/SANDIA LABS ?? Sandia National Laboratori­es computer scientists Tian Ma, left, and Rudy Garcia led a project to deliver actionable informatio­n from streaming data in nearly real time.
COURTESY OF RANDY MONTOYA/SANDIA LABS Sandia National Laboratori­es computer scientists Tian Ma, left, and Rudy Garcia led a project to deliver actionable informatio­n from streaming data in nearly real time.

Newspapers in English

Newspapers from United States