Big data
There’s data and there’s big data. Cynics will tell you the latter term is just another buzzword designed to help technology companies sell systems to large companies.
There’s a grain of truth in that idea. In 2013, Gartner, a technology industry research business, told subscribers big data had moved through the hype cycle from the “peak of inflated expectations” to the “trough of disillusionment”. That, by the way, isn’t as awful as it sounds. The next step on Gartner’s cycle is the “plateau of productivity”.
Yet when “big data” first appeared it had a specific meaning. It was the name given to dealing with amounts of data far greater than the processing capacity of everyday systems. Moreover, the implication is that when you have these vast amounts of data, you can get new insights that would be impossible to uncover by traditional means.
If there’s too much data for conventional computers and storage devices, or if it moves too fast, or is disorganised, then your project qualifies as big data. And that means you’ll need to find ways of dealing with it that go beyond conventional database technology.
One of the best examples of the use of big data is seismic exploration for energy exploration. It’s the sort of thing that New Zealand crown research institute GNS Science (previously Institute of Geological and Nuclear Sciences) do a lot of. Basically, seismic waves (the same tool used to study earthquakes) are sent deep into the ground, from a huge truck or a boat, and the reflected wave field from each rock the wave hits is recorded at the surface by sensors.
When you've got regular echo explosions coming out of each of a group of exploration boats 24 hours a day, that’s one hell of a lot of data.
The trouble is, as the GNS experts know all too well, the more accurate the image you get of all that stuff underground, the more likely you are to find some oil or gas.
So the more GNS’s ability to deal with large datasets increases, the more their customers keep piling on the data.
One of the problems for a lot of companies is that the data flow isn’t even – sometimes you have nose-bleed high peaks in the information coming in, sometimes it’s just a drop or two. However, dealing with the peaks and troughs doesn’t necessarily mean buying expensive hardware. You can buy computer power on an as-you-need-it basis from cloud computing companies.
Still, this is often a pricey part of the big data equation: the skills to organise, analyse and interpret complex projects are rare, so the practitioners get to charge accordingly.
Three characteristics separate a true big data project from everyday data: volume, velocity and variety. VOLUME is the amount of data. Having vast amounts of data is the key point. With more data, analysts can build better models to understand whatever they are looking for. The idea is that if you forecast, say, market conditions, comparing 400 data points will give you a more complete handle on where things are heading than just comparing five data points. While that’s true up to a point, the old “garbage in, garbage out” rule still applies.
Companies typically collect vast amounts of data that are difficult to store and move using everyday tools. This often includes internally generated data. A phone company might have databases on customer calling patterns and so on. However it doesn't have to restrict analysis to its own data. It can buy databases from external agencies, or sift through social media databases pulling out publically available tweets or Facebook posts. VELOCITY: The rate at which data is generated and captured is important. Companies need timely information. Real time or near-real time processing means a marketing campaign can be changed if, for example, there’s a negative response to an early advertisement. An online retailer might start gathering data when a customer enters the website and be able to cook up compelling offers to squeeze out more dollars before they get to the checkout.
It’s also important to have up-to-date competitive information. Armed with timely information about a rival’s initiative-winning business, companies can automate, or near-automate responses. The player in any market with the fastest reactions to changing conditions has a solid competitive advantage. VARIETY: Data isn’t always nice and tidy. Big data typically pulls information from structured and unstructured sources; they can be messy. Think of extracting information from tweets, Facebook updates, blog posts, online comments and video, as well as conventional relational databases. Increasingly, data is also collected from connected devices such as smartphones, smart electricity meters or embedded sensors. That’s only going to increase as more and more devices are connected to the internet.
Some big data boffins add a fourth V: Veracity. This comes down to the trustworthiness of the incoming data. Traditional database technology works on the assumption the data is clean, precise and accurate. That’s often not the case with the material collected for big data projects. A Twitter user complaining about a product and tweeting their intention to stop doing business with a company might not be telling the truth. It’s possible for rivals to pollute data, something that’s hard to spot when you’re moving fast.
Real time or near-real time processing means a marketing campaign can be changed if, for example, there’s a negative response to an early advertisement. An online retailer might start gathering data when a customer enters the website and be able to cook up compelling offers to squeeze out more dollars before they get to the checkout.