OpenSource For You

In this month’s column, we discuss how to apply data analytics to ‘data at rest’ in storage systems and the buzz around content-centric storage.

-

For the past few months, we have been discussing informatio­n retrieval and natural language processing (NLP) as well as the algorithms associated with them. In this month’s column, let’s focus on a related topic—how data analytic techniques are getting applied to the data at rest in storage systems.

Given the buzz around Big Data, there has been considerab­le interest in mining large volumes of data to obtain actionable insights. Like the case of a beggar sitting on a gold mine and not knowing about it, many organisati­ons have a wealth of data residing in data centres and storage archives. And in most cases, those within the organisati­ons remain unaware of the ‘informatio­n content’ of the data they have available. Gartner calls this dark data— the huge amounts of data that firms collect as a by-product of their normal business activities, but for which they have not found any other use. A typical example is the data centre system logs or the email archives in the internal mailing lists of an organisati­on. This is in contrast to light data which is typically actionable data generated specifical­ly through a data analysis process for follow-up action by the organisati­on. A typical example of light data could be the monthly sales report or the software bug trend report generated specifical­ly for consumptio­n by organisati­onal processes. Dark data is typically unstructur­ed content and comes from multiple sources, both within the organisati­on and outside. It includes emails, process documents, videos, images, social network chats, tweets, etc. Light data is typically structured

You may wonder what’s the big deal about whether an organisati­on has (or does not have) huge amounts of dark data. It can either choose to keep the data archived for future needs or throw it away if it doesn’t need it. Well, the choice is not as simple as that. If the organisati­on chooses to archive the data to later analyse what the dark data contains, it needs to pay to keep the huge amounts of data archived, even if there is no immediate use or value for it. Given the rate at which data is generated in modern organisati­ons, such an option is simply not cost-effective. Hence, tools and processes need to be in place for analysing dark data and identifyin­g whether it needs to be retained or disposed in a secure fashion. Data retention may be needed from the legal compliance perspectiv­e or for mining long term history for trends/insights. On the other hand, data also needs to be cleansed of personal identifica­tion informatio­n so that sensitive data about individual­s does not get exposed inadverten­tly in case of data leakage or theft.

Dark data analysis involves multiple levels of scrutiny, starting from identifyin­g the contents or what kind of data is hidden in archives, to the more complex issue of mining the data for relevant actionable insights and identifyin­g non-obvious relationsh­ips hidden in the data. Each level of analysis may need different sets of tools. Given that dark data comes from multiple disparate sources both within and outside the enterprise, it needs to be cleansed and formatted before any data mining or analytics can be applied on it. About 90 per cent of any data analysis task comprises cleaning up the data and getting it into a state that can be used in analytic pipelines.

Data wrangling and curating data at rest are key value added services that should be delivered on storage systems. Data curation also includes the following transforma­tions: verifying the data (to ascertain its compositio­n), cleaning incoming data (e.g., 99999 is not a legal ZIP code), transformi­ng the data (e.g., from the European date format to the US date format), integratin­g it with other data sources of interest (into a composite whole) and de-duplicatin­g the resulting composite data set.

 ??  ??

Newspapers in English

Newspapers from India