Daily Mirror (Sri Lanka)

DATA LAKES – TAKE THE PLUNGE KNOWINGLY

- BY G.K. KULATILLEK­E

Data and informatio­n is the life blood of organisati­ons and businesses as it provides vital informatio­n for operationa­l, tactical and strategic decisions. Storage for this vital informatio­n scales from a simple spreadshee­t, to traditiona­l databases and data ware houses and most recently to data lakes.

A data lake is a storage repository, which holds huge quantities of raw data in its native format. While a hierarchic­al data warehouse stores data in files or folders, a data lake uses a flat architectu­re to store data. It does not impose any structure. It can be incomplete. It can contain incorrect or missing informatio­n. Data stored in a lake may not have a designated purpose at the time of storage. It is simply kept until, eventually it may be, needed.

A data lake, in this essence can be seen as a relaxed, unimposing and unconstrai­ned collection of an organisati­on’s huge highly diverse data. They are optimized for scaling to terabytes and petabytes of data, which typically comes from multiple heterogene­ous sources and may be structured, semi-structured or unstructur­ed. This relaxation of structure and diversity of a data lake has many benefits.

Features and benefits

A data lake retains all data. This includes data currently seen as useful as well as those that currently serves no purpose, in the hope that it might be required in the future. Data is also kept for all time so that the users can go back in time to any point for analysis. This only became possible with recent commodity, off-the-shelf servers and cheap storage making scaling to terabytes and petabytes fairly economical.

Data lakes support all data types. This includes traditiona­l as well as nontraditi­onal data sources such as web server logs, sensor data, social network activity, text and images. In the data lake, all data is retained regardless of its source and structure. Data is stored in its original raw form. Any required transforms are done at the point of use. This approach is known as ‘schema-onread’, where as traditiona­l databases and data warehouses use a ‘schema-onwrite’ approach.

With schema-on-write, designers need to think all possible uses of the data in advance and define a schema that has something for everyone, which never gives the perfect fit for anyone. With schema-on-read, structure is not predetermi­ned allowing data to be retrieved in a schema that is most relevant to the task at hand. The absence of scheme is also useful when large databases are being consolidat­ed.

Finally using a schema-on-read approach means data can be simply stored and used immediatel­y, with no time cost and effort spent on structural design. This is important when dealing with structured data but even more important when dealing with semi-structured, poly-structured and unstructur­ed data, which is the vast majority by volume.

Data lakes support all users. A typical organisati­on has around 80 percent ‘operationa­l’ users who are interested in reports, key performanc­e metrics (KPIS) or slices the same set of data every day. The next 10 percent does more analysis on the data often drilling down into internal data and sometimes external data.

The last few percent requires deep analysis. They may create totally new data sources based on research. They mash up many different types of data and come up with new insights, understand­ings and models. These users include the data scientists and they may use advanced analytic tools and capabiliti­es such as statistica­l analysis, machine learning and predictive modelling.

A data lake is able to support all of these users efficientl­y. Data scientists are able to work with the very large and varied data sets they need while other users make use of more structured views of the data provided for their use.

A data lake readily adapts to changes and new requiremen­ts. This is a direct result of the lack of structure and storing of data in its raw form. Users can explore data in varied ways and if any result is seen as useful a more formal schema can be applied and automation and reusabilit­y used to help extend the results to a broader audience. If the result is not useful, it can simply be discarded as no changes to the data structures have been made and no resources have been consumed.

The above advantages together means that data lakes are able to provide faster insights as they contain all data and data types, enables users to access data before it has been transforme­d, cleansed and structured in many flexible arrangemen­ts and thereby enables users to get to their results faster than the traditiona­l data warehouse approach.

Pitfalls and problems

However, the relaxed structure and flexibilit­y come with added complicati­ons and challenges.

Data lakes can easily become data swamps. A swamp is a dirty lake, where it is impossible or hard to locate the required data. Due to large volumes and as data cannot be identified from its structural characteri­stics, it is vital to ensure adequate meta data (data about what the data represents) is available about the data in the lake. This meta data allows searching, indexing and understand­ing what the data in a lake actually represents.

While many technologi­es are able to address aspects of the problem, the primary challenge is making sure that a data set can be seen for what it is and that the process of finding data (through the metadata data catalogue) is connected to the process of collecting informatio­n about the data.

A wider group of users means a much wider set of skills and competenci­es are required. While the data scientists in the organisati­on may be equipped to search, filter, join, shape and prep the data as need it is very unlikely that the rest of the business users can competentl­y extract data from the lake unaided. The solution is to create simpler views and common reports that are readily accessible.

Data sensitivit­y is also a main issue. This includes, for example, confidenti­al and proprietar­y informatio­n from a business perspectiv­e as well as personally identifyin­g informatio­n (PII) from a legal perspectiv­e, which should be restricted. This is however a grey area.

While management wants to allow the data scientists full access, the legal perspectiv­e dictates that they shouldn’t have access to full credit card numbers of the customers. These require case by case study and custom filtering and restrictio­ns.

Notably, the concept of governance over data lakes does not diminish the freespirit­ed exploratio­n of data. While it will require some effort and resources it greatly enhances the utility of the data to the largest group of users and lowers the risk of data misuse.

Finally, it should be understood that a data lake is not a product but an approach an organisati­on uses to collect (and catalogue) its informatio­n for usage. Machine learning and big data is at the heart of insight and knowledge discovery from the data lake. However, a data lake can become a useless data swamp if good governance policies are not applied and constantly enforced.

While the future seems to be in data lakes, realizing the benefits requires a great deal of good old-fashioned human effort and care. Organisati­ons must tread surely but knowledgab­ly and carefully to reap its full benefits and not end up in data puddles or data swamps.

(The views and opinions expressed

in this article are those of G.K. Kulatillek­e (BSC Eng.(computer), MSC.

(Networking), MSC. (Data Science), ACMA, CGMA) and do not necessaril­y reflect the official policy or position of

any institutio­n)

WHILE THE FUTURE SEEMS TO BE IN DATA LAKES, REALIZING THE BENEFITS REQUIRES A GREAT DEAL OF GOOD OLD-FASHIONED HUMAN EFFORT AND CARE. ORGANISATI­ONS MUST TREAD SURELY BUT KNOWLEDGAB­LY AND CAREFULLY TO REAP ITS FULL BENEFITS AND NOT END UP IN DATA PUDDLES OR DATA SWAMPS

 ??  ??
 ??  ??

Newspapers in English

Newspapers from Sri Lanka