Data Lakes: Concept, Architecture and Benefits

Data lakes capture, refine, and explore data in its raw form. Find out how they are shaping the future of data management and analytics.

2023-07-01 - By: Surabhi Dwivedi

With the advent of social media, IoT and other advancements in technology, a huge amount of data is getting generated. The concept of data lakes emerged in order to extract the maximum benefits from this data -- for enhanced adaptability and strong data analytics. A data lake is a storage space for storing heterogeneous data, both organised as well as unstructured. It improves the capture, refinement, and exploration of raw data within an enterprise. The data is kept in its original form, and the structure of the data is defined at the time of use, eliminating complex and costly data modelling.

Traditional decision support systems (DSS) are incapable of handling the huge amount of structured, unstructured, or semi-structured data generated by different resources. The data warehouse (DW) is the solution used by DSS.

Here, the data is extracted, transformed and loaded (ETL processes) according to predefined schemas. However, the cost of a DW increases significantly as the data size and complexity increase, and some information is lost through ETL processes.

Data lake architecture

Understanding data lake architecture can lead to more efficient data storage, faster processing, and better decisionmaking. The data lake architecture has two versions.

1. Mono–zone: This is a flat architecture that stores data in its native format. This architecture does not process data or record any user operations. It contains five data ponds that store data according to their characteristics:

● Raw data ponds

● Analog data ponds, to store analog types of data

● Application data ponds, to store application data

● Textual data ponds, to store text data ● Archival data ponds, to store data that is no longer in use

2. Multi-zone: The multi-zone architecture has the following zones.

● Ingestion: Contains the raw data. ● Storage: The ingested raw data is stored here.

● Processing: Whenever data is processed, it is stored in the processing zone.

This can be further divided into the following subzones.

• Batch processing

• Real-time processing

This zone controls data security, data quality, metadata management, and data life cycle.

Figure 1 defines the functional architecture of a data lake.

Data lake storage systems

Data lakes are changing the game for how businesses store and manage their data. Instead of siloed databases and spreadsheets, data lake systems allow you to store and access massive amounts of data in one place, giving you the flexibility to analyse it in real-time. They use different types of storage systems to achieve this. These are as follows.

● File-based storage systems: Hadoop Distributed File System (HDFS) and Azure Data Lake by Microsoft are file-based data lakes used for storage.

● Single data store: These types of data stores focus on specific types of data and use a single database system for their storage.

● Cloud-based data lakes: Large scale commercial data lakes are available on cloud infrastructure like Amazon Web Services (AWS), Azure Data Lake Store, Google Cloud Platform (GCP), Alibaba Cloud, and the Data Cloud from Snowflake.

Advantages of data lakes

Data lakes empower organisations to gain insights and create actionable strategies. However, there is a lot more to them.

● Cost-effective: Data lakes are less expensive to deploy than traditional decision-oriented databases.

● DATA fiDELITY: They preserve the original data to avoid any data loss that could occur from data preprocessing and transformation operations. However, data fidelity also introduces a high risk of data inconsistency in data lakes due to data integration from multiple, disparate sources without any transformation.

● FLEXIBILITY AND AGILITY: Data lakes have a schema-on-read approach; so they can read any data type and format. Thus, data lakes enable a wider range of analyses than traditional decision-oriented databases, such as data warehouses and data marts, and show better flexibility and agility.

● REAL-TIME DATA INGESTION: Data is ingested into the data lake without any transformation, minimising the time lag between data extracted from sources and its ingestion into databases.

References

● HIGH SCALABILITY: As data lakes are implemented using distributed technologies, they provide high scalability.

● FAULT TOLERANCE: The underlying technologies of data lakes provide high resilience to both hardware and software failures, resulting in excellent fault tolerance.

To sum up, data lakes empower organisations to uncover valuable insights from their data, paving the way for data-driven decision-making in the digital age.

 G. Singh and B. S. Bhati J. Singh, ‘The Implication of Data Lake in Enterprises: A Deeper Analytics’, 8th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, 2022, pp. 530-534

 H. Fang, ‘Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem’, in IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), Shenyang, China, 2015, pp. 820-824

 ‘Data Lakes Trends and Perspectives’, 2019, pp. 2-11

 Christos Koutras, Christoph Quix, Matthias Jarke Rihan Hai, ‘Data Lakes: A Survey

of Functions and Systems’, 2023

 ‘On data lake architectures and metadata management’, in Journal of Intelligent

Information Systems, 2020, pp. 97–120

The author is a joint director at the Centre for Development of Advanced Computing (CDAC). She has an experience of 16+ years in the areas of software development, database technologies, data modelling, data mining, open source technologies, etc.

?? ?? Figure 1: Architecture of a data lake — Figure 1: Architecture of a data lake

Data Lakes: Concept, Architecture and Benefits

Data lakes capture, refine, and explore data in its raw form. Find out how they are shaping the future of data management and analytics.

Data lake architecture

Data lake storage systems

Advantages of data lakes

References

Newspapers in English

Newspapers from India

Data Lakes: Concept, Architectu­re and Benefits

Data lakes capture, refine, and explore data in its raw form. Find out how they are shaping the future of data management and analytics.

Data lake architectu­re

Data lake storage systems

Advantages of data lakes

References

Newspapers in English

Newspapers from India

Data Lakes: Concept, Architecture and Benefits

Data lake architecture