In this month’s column, we discuss how to apply data analytics to ‘data at rest’ in storage systems and the buzz around content-centric storage.
For the past few months, we have been discussing information retrieval and natural language processing (NLP) as well as the algorithms associated with them. In this month’s column, let’s focus on a related topic—how data analytic techniques are getting applied to the data at rest in storage systems.
Given the buzz around Big Data, there has been considerable interest in mining large volumes of data to obtain actionable insights. Like the case of a beggar sitting on a gold mine and not knowing about it, many organisations have a wealth of data residing in data centres and storage archives. And in most cases, those within the organisations remain unaware of the ‘information content’ of the data they have available. Gartner calls this dark data— the huge amounts of data that firms collect as a by-product of their normal business activities, but for which they have not found any other use. A typical example is the data centre system logs or the email archives in the internal mailing lists of an organisation. This is in contrast to light data which is typically actionable data generated specifically through a data analysis process for follow-up action by the organisation. A typical example of light data could be the monthly sales report or the software bug trend report generated specifically for consumption by organisational processes. Dark data is typically unstructured content and comes from multiple sources, both within the organisation and outside. It includes emails, process documents, videos, images, social network chats, tweets, etc. Light data is typically structured
You may wonder what’s the big deal about whether an organisation has (or does not have) huge amounts of dark data. It can either choose to keep the data archived for future needs or throw it away if it doesn’t need it. Well, the choice is not as simple as that. If the organisation chooses to archive the data to later analyse what the dark data contains, it needs to pay to keep the huge amounts of data archived, even if there is no immediate use or value for it. Given the rate at which data is generated in modern organisations, such an option is simply not cost-effective. Hence, tools and processes need to be in place for analysing dark data and identifying whether it needs to be retained or disposed in a secure fashion. Data retention may be needed from the legal compliance perspective or for mining long term history for trends/insights. On the other hand, data also needs to be cleansed of personal identification information so that sensitive data about individuals does not get exposed inadvertently in case of data leakage or theft.
Dark data analysis involves multiple levels of scrutiny, starting from identifying the contents or what kind of data is hidden in archives, to the more complex issue of mining the data for relevant actionable insights and identifying non-obvious relationships hidden in the data. Each level of analysis may need different sets of tools. Given that dark data comes from multiple disparate sources both within and outside the enterprise, it needs to be cleansed and formatted before any data mining or analytics can be applied on it. About 90 per cent of any data analysis task comprises cleaning up the data and getting it into a state that can be used in analytic pipelines.
Data wrangling and curating data at rest are key value added services that should be delivered on storage systems. Data curation also includes the following transformations: verifying the data (to ascertain its composition), cleaning incoming data (e.g., 99999 is not a legal ZIP code), transforming the data (e.g., from the European date format to the US date format), integrating it with other data sources of interest (into a composite whole) and de-duplicating the resulting composite data set.