Open Source for you

Data Lakes: Concept, Architectu­re and Benefits

Data lakes capture, refine, and explore data in its raw form. Find out how they are shaping the future of data management and analytics.

- By: Surabhi Dwivedi

With the advent of social media, IoT and other advancemen­ts in technology, a huge amount of data is getting generated. The concept of data lakes emerged in order to extract the maximum benefits from this data -- for enhanced adaptabili­ty and strong data analytics. A data lake is a storage space for storing heterogene­ous data, both organised as well as unstructur­ed. It improves the capture, refinement, and exploratio­n of raw data within an enterprise. The data is kept in its original form, and the structure of the data is defined at the time of use, eliminatin­g complex and costly data modelling.

Traditiona­l decision support systems (DSS) are incapable of handling the huge amount of structured, unstructur­ed, or semi-structured data generated by different resources. The data warehouse (DW) is the solution used by DSS.

Here, the data is extracted, transforme­d and loaded (ETL processes) according to predefined schemas. However, the cost of a DW increases significan­tly as the data size and complexity increase, and some informatio­n is lost through ETL processes.

Data lake architectu­re

Understand­ing data lake architectu­re can lead to more efficient data storage, faster processing, and better decisionma­king. The data lake architectu­re has two versions.

1. Mono–zone: This is a flat architectu­re that stores data in its native format. This architectu­re does not process data or record any user operations. It contains five data ponds that store data according to their characteri­stics:

● Raw data ponds

● Analog data ponds, to store analog types of data

● Applicatio­n data ponds, to store applicatio­n data

● Textual data ponds, to store text data ● Archival data ponds, to store data that is no longer in use

2. Multi-zone: The multi-zone architectu­re has the following zones.

● Ingestion: Contains the raw data. ● Storage: The ingested raw data is stored here.

● Processing: Whenever data is processed, it is stored in the processing zone.

This can be further divided into the following subzones.

• Batch processing

• Real-time processing

This zone controls data security, data quality, metadata management, and data life cycle.

Figure 1 defines the functional architectu­re of a data lake.

Data lake storage systems

Data lakes are changing the game for how businesses store and manage their data. Instead of siloed databases and spreadshee­ts, data lake systems allow you to store and access massive amounts of data in one place, giving you the flexibilit­y to analyse it in real-time. They use different types of storage systems to achieve this. These are as follows.

● File-based storage systems: Hadoop Distribute­d File System (HDFS) and Azure Data Lake by Microsoft are file-based data lakes used for storage.

● Single data store: These types of data stores focus on specific types of data and use a single database system for their storage.

● Cloud-based data lakes: Large scale commercial data lakes are available on cloud infrastruc­ture like Amazon Web Services (AWS), Azure Data Lake Store, Google Cloud Platform (GCP), Alibaba Cloud, and the Data Cloud from Snowflake.

Advantages of data lakes

Data lakes empower organisati­ons to gain insights and create actionable strategies. However, there is a lot more to them.

● Cost-effective: Data lakes are less expensive to deploy than traditiona­l decision-oriented databases.

● DATA fiDELITY: They preserve the original data to avoid any data loss that could occur from data preprocess­ing and transforma­tion operations. However, data fidelity also introduces a high risk of data inconsiste­ncy in data lakes due to data integratio­n from multiple, disparate sources without any transforma­tion.

● FLEXIBILIT­Y AND AGILITY: Data lakes have a schema-on-read approach; so they can read any data type and format. Thus, data lakes enable a wider range of analyses than traditiona­l decision-oriented databases, such as data warehouses and data marts, and show better flexibilit­y and agility.

● REAL-TIME DATA INGESTION: Data is ingested into the data lake without any transforma­tion, minimising the time lag between data extracted from sources and its ingestion into databases.

References

● HIGH SCALABILIT­Y: As data lakes are implemente­d using distribute­d technologi­es, they provide high scalabilit­y.

● FAULT TOLERANCE: The underlying technologi­es of data lakes provide high resilience to both hardware and software failures, resulting in excellent fault tolerance.

To sum up, data lakes empower organisati­ons to uncover valuable insights from their data, paving the way for data-driven decision-making in the digital age.

 G. Singh and B. S. Bhati J. Singh, ‘The Implicatio­n of Data Lake in Enterprise­s: A Deeper Analytics’, 8th Internatio­nal Conference on Advanced Computing and Communicat­ion Systems (ICACCS), Coimbatore, 2022, pp. 530-534

 H. Fang, ‘Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem’, in IEEE Internatio­nal Conference on Cyber Technology in Automation, Control, and Intelligen­t Systems (CYBER), Shenyang, China, 2015, pp. 820-824

 ‘Data Lakes Trends and Perspectiv­es’, 2019, pp. 2-11

 Christos Koutras, Christoph Quix, Matthias Jarke Rihan Hai, ‘Data Lakes: A Survey

of Functions and Systems’, 2023

 ‘On data lake architectu­res and metadata management’, in Journal of Intelligen­t

Informatio­n Systems, 2020, pp. 97–120

The author is a joint director at the Centre for Developmen­t of Advanced Computing (CDAC). She has an experience of 16+ years in the areas of software developmen­t, database technologi­es, data modelling, data mining, open source technologi­es, etc.

 ?? ??
 ?? ?? Figure 1: Architectu­re of a data lake
Figure 1: Architectu­re of a data lake

Newspapers in English

Newspapers from India