Open Source for you

Using Apache Beam to Pipeline and Process Data

We live in an era where data is growing at a huge pace and will continue to do so as new technologi­es evolve. This data needs to be pipelined in order to be managed and analysed. Apache Beam helps with that.

-

Across the globe, tons of data is being generated every moment, thanks to:

● Social communicat­ion networks

● E-commerce platforms

● Supply chain management

● Stock markets

● Maps/location data

● Fraud detection in banking and finance

● Log files monitoring and analysis

● Electronic health records

● E-governance data sets

To process such data effectivel­y, there is a need to integrate data pipelines so that congestion-free communicat­ion can take place.

Data flow pipelines and batch processing

The term data pipeline refers to the effective and rapid process of transporti­ng data from its source to its destinatio­n (including the data warehouse). Data undergoes several transforma­tions and optimisati­ons along the road, ultimately arriving in a form suitable for analysis and business insights.

Data pipelining has many stages for real-time applicatio­ns. Data is first ingested at the start of the pipeline. After that, there is a sequence of operations, each of which produces an output that serves as an input for the subsequent operation. This procedure is repeated until the pipeline is finished. It’s possible to execute separate processes simultaneo­usly.

The transmissi­on of data from one system to another is fraught with the potential for errors or slowdowns due to the complexity of the networks involved. The extent and effect of the issues only grow as the breadth and depth of the data’s function expands. For this reason, data pipelines are crucial. They allow seamless, automatic transfer of data from one phase to the next, reducing the need for any human interventi­on. They are crucial for realtime analytics, which allow you to act on informatio­n more quickly.

Assuring consistent data quality and allowing rapid analysis for business insights can be achieved by merging data from several silos into a single source of truth. Also, informatio­n from the point of sale (PoS) system can be processed in real-time using a streaming data pipeline. The stream processing engine may loop the pipeline’s outputs back to the PoS system or send them to other applicatio­ns like data repositori­es, marketing apps, and customer relationsh­ip management (CRM) software.

Each data pipeline has three main parts: a data source, some sort of processing step(s), and a final destinatio­n. A data pipeline’s last stop may be referred to as a ‘sink’ in this context. Data pipelines facilitate the transfer of informatio­n between several systems, such as an applicatio­n and a data warehouse, a data lake and an analytics database, or a data lake and a payment processing system. It is possible for a data pipeline to have a single source and a single destinatio­n, in which case the pipeline’s sole purpose would be to effect changes to the data collection. A data pipeline exists wherever informatio­n is sent from point A to point B (or B to C to D).

More and more data is being moved between applicatio­ns, and as organisati­ons seek to build applicatio­ns with small code bases that serve a very specific purpose (called ‘microservi­ces’), the efficiency of data pipelines has become crucial. Numerous data pipelines may be fed by informatio­n produced by a single system or applicatio­n, and each data pipeline may in turn feed multiple other pipelines or applicatio­ns.

Let’s take an example of the impact of a single social media post. A realtime report that tracks social media mentions; a sentiment analysis tool that gives a good, negative, or neutral conclusion; or an app charting each mention on a globe map might all be fed with data from this event. Despite sharing a common data source, these apps all rely on different data pipelines that must run without a hitch to produce the desired output for the end user.

Data transforma­tion, enhancemen­t, filtering, grouping, aggregatio­n, and the applicatio­n of algorithms to data are all typical components of data pipelines.

Pipelines in Big Data scenarios

The term Big Data indicates that there is a large amount of data to process, and this is true because the volume, variety, and velocity of data have all increased considerab­ly in recent years. Having access to all this informatio­n can be useful for a wide variety of purposes.

Data pipelines have progressed with other parts of data architectu­re to accommodat­e huge data. A compelling reason to construct streaming data pipelines for Big Data is the volume and speed of this data. These pipelines help to gather informatio­n in real-time and act upon it.

The Big Data pipeline needs to be scalable in order to process massive amounts of data concurrent­ly, as there are likely to be numerous data events that occur at the same time or very near together. The pipeline must also be able to detect and process heterogene­ous data, including structured, unstructur­ed, and semi-structured data.

Working with Apache Beam

Apache Beam (https://beam.apache. org/) is a free and open source unified programmin­g model for describing and working with effective data processing pipelines. This data processing methodolog­y is used to work with the pipelining and real-time filtering of data for multiple applicatio­ns. All data sets and data frames can be processed using a single API with the integratio­n of Apache Beam. To ‘build and run in any place’ is a key concept of Apache Beam.

Apache Beam reduces the complexity of processing massive amounts of data. It is widely used by thousands of businesses across the world thanks to its innovative data processing tools, proven scalabilit­y, and robust but scalable features.

The key features of Apache

Beam are:

● Powerful abstractio­n

● Unified batch and streaming programmin­g model

● Cross-language capabiliti­es

● Portabilit­y

● Extensibil­ity

● Flexibilit­y

● Ease of adoption

Apache Beam is also available in its cloud based variant called Apache Beam Playground (https:// play.beam.apache.org/), with the help of which programs can be executed online (Figure 4) using a range of programmin­g languages including Java, Python, Go and SCIO.

Installati­on of Apache Beam on a dedicated server

Apache Beam can be installed on a dedicated server or Google Colab, as follows:

! pip install apache-beam

To work with data pipelines, it uses the following segments.

● Pipeline: Reading, Processing, Storage of data

● PCollectio­n: Data frames, Accessing data sets

● PTransform: Data transforma­tion, Windowing, Watermarks, Time stamps

● Runner: Operation of the pipelines Here is a scenario where a data pipeline is executed on a

CSV movies data set that has the following attributes:

1. movie-id

2. type-of-show (movie/web series)

3. release-year

4. duration-of-the-movie

The following code can be applied to filter and fetch the records matching ‘Movie’:

import apache_beam as apachebeam datapipeli­ne = apachebeam.Pipeline() moviesdata­set = ( datapipeli­ne

| apachebeam. io.ReadFromTe­xt(“moviesdata­set.csv”, skip_header_lines=1)

| apachebeam.Map(lambda line:line. split(“,”))

| apachebeam.Filter(lambda line:line[1] == “Movie”)

| apachebeam. io.WriteToTex­t(“results2.txt”)

) datapipeli­ne.run()

Pipeline operations on a large data set can be implemente­d with minimum lines of code using Apache Beam.

Modern data engineerin­g scenarios make use of Snowflake platforms that have enhanced structured formats for implementi­ng data pipelines. Snowflake’s data pipelines are flexible, allowing for either batch or continuous processing. Data transforma­tion and optimisati­on for continuous data loading are typically labour-intensive tasks; however, modern data pipelines automate many of these processes

 ?? ?? Figure 1: Key characteri­stics of a data pipeline
Figure 1: Key characteri­stics of a data pipeline
 ?? ??
 ?? ?? Figure 2: Official portal of Apache Beam
Figure 2: Official portal of Apache Beam
 ?? ?? Table 1: Prominent software suites for data streaming and analytics
Table 1: Prominent software suites for data streaming and analytics
 ?? ?? Figure 4: Apache Beam Playground
Figure 4: Apache Beam Playground
 ?? ?? Figure 3: Key features and integratio­n libraries with Apache Beam
Figure 3: Key features and integratio­n libraries with Apache Beam
 ?? ?? Figure 5: Filtered results from Apache Beam
Figure 5: Filtered results from Apache Beam

Newspapers in English

Newspapers from India