ETL and Data Pipelines with Shell, Airflow and Kafka

ETL and Data Pipelines with Shell, Airflow and Kafka

This course is part of multiple programs.

Instructors: Jeff Grossman

What you'll learn

Describe and contrast Extract, Transform, Load (ETL) processes and Extract, Load, Transform (ELT) processes.
Explain batch vs concurrent modes of execution.
Implement ETL workflow through bash and Python functions.
Describe data pipeline components, processes, tools, and technologies.

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

11 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

Build your subject-matter expertise

This course is available as part of

When you enroll in this course, you'll also be asked to select a specific program.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

There are 5 modules in this course

Delve into the two different approaches to converting raw data into analytics-ready data. One approach is the Extract, Transform, Load (ETL) process. The other contrasting approach is the Extract, Load, and Transform (ELT) process. ETL processes apply to data warehouses and data marts. ELT processes apply to data lakes, where the data is transformed on demand by the requesting/calling application.

In this course, you will learn about the different tools and techniques that are used with ETL and Data pipelines. Both ETL and ELT extract data from source systems, move the data through the data pipeline, and store the data in destination systems. During this course, you will experience how ELT and ETL processing differ and identify use cases for both. You will identify methods and tools used for extracting the data, merging extracted data either logically or physically, and for loading data into data repositories. You will also define transformations to apply to source data to make the data credible, contextual, and accessible to data users. You will be able to outline some of the multiple methods for loading data into the destination system, verifying data quality, monitoring load failures, and the use of recovery mechanisms in case of failure. By the end of this course, you will also know how to use Apache Airflow to build data pipelines as well be knowledgeable about the advantages of using this approach. You will also learn how to use Apache Kafka to build streaming pipelines as well as the core components of Kafka which include: brokers, topics, partitions, replications, producers, and consumers. Finally, you will complete a shareable final project that enables you to demonstrate the skills you acquired in each module.

ETL or Extract, Transform, and Load processes are used for cases where flexibility, speed, and scalability of data are important. You will explore some key differences between similar processes, ETL and ELT, which include the place of transformation, flexibility, Big Data support, and time-to-insight. You will learn that there is an increasing demand for access to raw data that drives the evolution from ETL to ELT. Data extraction involves advanced technologies including database querying, web scraping, and APIs. You will also learn that data transformation is about formatting data to suit the application and that data is loaded in batches or streamed continuously.

What's included

7 videos2 readings2 assignments1 plugin

7 videosTotal 32 minutes

Course Intro video5 minutesPreview module
ETL Fundamentals5 minutes
ELT Basics4 minutes
Comparing ETL and ELT4 minutes
Data Extraction Techniques4 minutes
Introduction to Data Transformation Techniques4 minutes
Data Loading Techniques3 minutes

2 readingsTotal 7 minutes

Course Introduction4 minutes
Summary & Highlights3 minutes

2 assignmentsTotal 40 minutes

ETL and ELT Processes10 minutes
Graded Quiz: ETL and ELT Processes30 minutes

1 pluginTotal 5 minutes

Interactivity: Tell the Difference between ETL and ELT5 minutes

Extract, transform and load (ETL) pipelines are created with Bash scripts that can be run on a schedule using cron. Data pipelines move data from one place, or form, to another. Data pipeline processes include scheduling or triggering, monitoring, maintenance, and optimization. Furthermore, Batch pipelines extract and operate on batches of data. Whereas streaming data pipelines ingest data packets one-by-one in rapid succession. In this module, you will learn that streaming pipelines apply when the most current data is needed. You will explore that parallelization and I/O buffers help mitigate bottlenecks. You will also learn how to describe data pipeline performance in terms of latency and throughput.

What's included

5 videos4 readings4 assignments1 app item1 plugin

5 videosTotal 25 minutes

ETL Using Shell Scripting4 minutesPreview module
Introduction to Data Pipelines4 minutes
Key Data Pipeline Processes4 minutes
Batch versus Streaming Data Pipeline Use Cases4 minutes
Data Pipeline Tools and Technologies6 minutes

4 readingsTotal 15 minutes

Linux Commands and Shell Scripting2 minutes
ETL Techniques10 minutes
Summary & Highlights1 minute
Summary & Highlights2 minutes

4 assignmentsTotal 80 minutes

Practice Quiz: ETL using Shell Scripts10 minutes
Practice Quiz: An Introduction to Data Pipelines10 minutes
Graded Quiz: ETL using Shell Scripts30 minutes
Graded Quiz: An Introduction to Data Pipelines30 minutes

1 app itemTotal 30 minutes

Hands-On Lab: ETL using Shell Scripts30 minutes

1 pluginTotal 10 minutes

Interactivity: Differentiate between Batch Processing and Stream Processing10 minutes

The key advantage of Apache Airflow's approach to representing data pipelines as DAGs is that they are expressed as code, which makes your data pipelines more maintainable, testable, and collaborative. Tasks, the nodes in a DAG, are created by implementing Airflow's built-in operators. In this module, you will learn about Apache Airflow having a rich UI that simplifies working with data pipelines. You will explore how to visualize your DAG in graph or tree mode. You will also learn about the key components of a DAG definition file, and you will learn that Airflow logs are saved into local file systems and then sent to cloud storage, search engines, and log analyzers.

What's included

5 videos1 reading2 assignments4 app items1 plugin

5 videosTotal 25 minutes

Apache Airflow Overview6 minutesPreview module
Advantages of Representing Data Pipelines as DAGs in Apache Airflow6 minutes
Apache Airflow UI3 minutes
Build a DAG Using Airflow4 minutes
Airflow Logging and Monitoring4 minutes

1 readingTotal 3 minutes

Summary & Highlights3 minutes

2 assignmentsTotal 40 minutes

Practice Quiz: Building Data Pipelines using Airflow10 minutes
Graded Quiz: Building Data Pipelines using Airflow30 minutes

4 app itemsTotal 120 minutes

Hands-on Lab: Getting Started with Apache Airflow20 minutes
Hands-on Lab: Create a DAG for Apache Airflow with PythonOperator40 minutes
Hands-on Lab: Create a DAG for Apache Airflow with BashOperator40 minutes
Hands-on Lab: Monitoring a DAG20 minutes

1 pluginTotal 15 minutes

Reading: DAG Structure and Operators15 minutes

Apache Kafka is a very popular open source event streaming pipeline. An event is a type of data that describes the entity’s observable state updates over time. Popular Kafka service providers include Confluent Cloud, IBM Event Stream, and Amazon MSK. Additionally, Kafka Streams API is a client library supporting you with data processing in event streaming pipelines. In this module, you will learn that the core components of Kafka are brokers, topics, partitions, replications, producers, and consumers. You will explore two special types of processors in the Kafka Stream API stream-processing topology: The source processor and the sink processor. You will also learn about building event streaming pipelines using Kafka.

What's included

4 videos1 reading2 assignments3 app items1 plugin

4 videosTotal 26 minutes

Distributed Event Streaming Platform Components5 minutesPreview module
Apache Kafka Overview6 minutes
Building Event Streaming Pipelines using Kafka9 minutes
Kafka Streaming Process5 minutes

1 reading

Summary & Highlights0 minutes

2 assignmentsTotal 40 minutes

Practice Quiz: Building Streaming Pipelines using Kafka10 minutes
Graded Quiz: Building Streaming Pipelines using Kafka30 minutes

3 app itemsTotal 90 minutes

Hands-on Lab: Working with Streaming Data using Kafka20 minutes
[Optional] Hands-on Lab: Kafka Message Keys and Offset40 minutes
[Optional] Hands-on Lab: Kafka Python Client30 minutes

1 pluginTotal 30 minutes

Kafka Python Client30 minutes

In this final assignment module, you will apply your newly gained knowledge to explore two very exciting hands-on labs. “Creating ETL Data Pipelines using Apache Airflow” and “Creating Streaming Data Pipelines using Kafka”. You will explore building these ETL pipelines using real-world scenarios. You will extract, transform, and load data into a CSV file. You will also create a topic named “toll” in Apache Kafka, download and customize a streaming data consumer, as well as verifying that streaming data has been collected in the database table.

What's included

4 readings1 assignment1 peer review3 app items

4 readingsTotal 24 minutes

Project Overview10 minutes
Graded Timed Final Exam Instructions10 minutes
Congrats & Next Steps2 minutes
Thanks from the Course Team2 minutes

1 assignmentTotal 90 minutes

Timed Final Quiz 90 minutes

1 peer reviewTotal 60 minutes

Peer Review: Project Submission and Peer Review60 minutes

3 app itemsTotal 225 minutes

Hands-on Lab: Build ETL Data Pipelines with BashOperator using Apache Airflow90 minutes
[Optional] Hands-on Lab: Build an ETL Pipeline using PythonOperator with Apache Airflow90 minutes
[Optional] Hands-on Lab: Build a Streaming ETL Pipeline using Kafka45 minutes