What Is a Data Pipeline? (+ How to Build One)

Written by Coursera Staff • Updated on

Learn more about data pipeline architecture, tools, and design.

[Featured image] A business intelligence analyst builds a data pipeline and dashboard for a business.

A data pipeline is a method of moving and ingesting raw data from its source to its destination. Modern data pipelines include both tools and processes. They are necessary because raw data often must undergo preparations before it can be used. The type of data pipeline an organisation uses depends on factors like business requirements and the size of the data.

Data pipeline vs ETL pipeline

Data pipeline is a broad term encompassing any process that moves data from one source to another. Extract, transform, load (ETL) pipelines are a type of data pipeline that focuses on individual batches of data for a specific purpose. Transformation may or may not be involved in other data pipelines, but it is always present in the ETL process. 

Placeholder

Types of data pipelines

  • Real-time data pipeline: Real-time analytics like financial insights require this type of data pipeline. Real-time data pipeline architecture can process millions of events instantaneously for enhanced reliability. 

  • Open-source data pipeline: Open-source pipelines are free for public use, although certain features may not be available. This cost-effective data pipelining technique is often used among small businesses and individuals who need data management.

  • Cloud data pipeline: This type of data pipeline is cloud-based. In other words, data is managed and processed via the internet rather than on local servers. 

  • Streaming data pipeline: Streaming pipelines are among the most commonly used. They can ingest both unstructured and structured data from various sources. 

  • Batch data pipeline: Batch processing pipelines are common, especially among organisations that manage large volumes of data. Batch-based processing is slower due to the massive amounts of data that must be processed, but it can minimise user interaction. 

Data pipeline example

Amazon Web Services is a web service designed to help users manage data processing and transportation. It can be used with on-premises data sources and AWS devices and services. If you want to practice working with AWS data analytics tools, consider taking the online, beginner-friendly course Getting Started with Data Analytics on AWS. In as little as three hours, you can gain key data analytics skills with industry experts. For example, you'll have the opportunity to learn how to perform descriptive data analytics in the cloud and explain different types of data analyses.

Data pipeline architecture

One way to visualise data pipeline architecture is with a conceptual process or workflow. 

First, a data pipeline begins where the data is generated and stored. Depending on the type of pipeline, this can be a single source or multiple sources. The data can be in any format, including raw, structured, and unstructured. 

Next, data is moved to where it will undergo processing and preparation, such as an ETL tool. Processing actions depend on business objectives and analytical requirements.

Finally, the data pipeline ends with an analysis. During this phase, data is moved to a data management system for valuable insights, such as business intelligence (BI). 

Another way to visualise data pipeline architecture is at the platform level. Platform implementations can be customised to fit specific analytical requirements. An example of a data pipeline's platform architecture from Google Cloud documentation is:

A Batch ETL Pipeline in GCP - The Source might be files that need ingested into the analytics Business Intelligence (BI) engine. Cloud Storage is the data transfer medium inside GCP and then Dataflow is used to load the data into the target BigQuery storage.

In the above example, the data pipeline begins at the source (files) and then moves to storage in the cloud. Next, it is transferred to Dataflow for processing and preparation. Finally, it enters the target database for analysis (Google BigQuery).

How to build a data pipeline

Before planning your data pipeline architecture, identify essential elements like purpose and scalability needs. A few things to keep in mind when planning your data pipeline include:

  • Analytical requirements: Consider what insights you want to gain from your data at the end of the pipeline. Will you use it for machine learning (ML), business intelligence (BI), or something else?

  • Volume: Consider how much data you will manage and whether that amount could change over time. 

  • Data types: Data pipeline solutions may have limitations based on data types. Identify the types of data you'll be working with (structured, streaming, raw).

1. Determine which type of data pipeline you need to use.

First, outline your needs, business goals, or target database requirements. Use the list above to determine which type of data pipeline to use. For example, if you need to manage large amounts of data, you may need to build a batch data pipeline. Organisations needing real-time processing for their insights may benefit from stream processing instead.   

2. Select your data pipeline tools.

Many different data pipeline tools are available. You can use a solution that includes end-to-end (entire process) pipeline management or combine individual tools for a hybrid, personalised solution. For example, when building a cloud data pipeline, you may need to combine cloud services (like storage) with an ETL tool that preps data for transfer to your target destination. 

3. Implement your data pipeline design.

After implementing your design, plan for maintenance, scaling, and continued improvement, consider information security (InfoSec) to protect sensitive data as it moves through the pipeline. Often, companies employ data engineers and architects to oversee data pipeline system planning, implementation, and monitoring. 

Learn more about building a data pipeline with Coursera.

Data pipelines, a combination of tools and processes, move and prepare raw data for analysis. Various types, like real-time or batch-processing pipelines, are suited for different needs. Businesses consider data volume and goals when choosing a pipeline type and its tools. Once built, data pipelines require maintenance and security measures.

You can compare methods of converting raw data into data ready for analytical use with IBM’s beginner-friendly online course, ETL, and Data Pipelines with Shell, Airflow, and Kafka. More advanced learners may consider constructing a data pipeline and earning the Google Business Intelligence Professional Certificate, a 100 percent online course.

Keep reading

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.