ETL and Data Pipelines with Shell, Airflow and Kafka

ETL and Data Pipelines with Shell, Airflow and Kafka

Name: ETL and Data Pipelines with Shell, Airflow and Kafka
Rating: 4.533916849015317 (457 reviews)

This course is part of multiple programs.

Instructors: Jeff Grossman

69,382 already enrolled

Included with

Learn more

5 modules

Gain insight into a topic and learn the fundamentals.

457 reviews

Intermediate level

Recommended experience

Flexible schedule

2 weeks at 10 hours a week

Learn at your own pace

88%

Most learners liked this course

5 modules

Gain insight into a topic and learn the fundamentals.

457 reviews

Intermediate level

Recommended experience

Flexible schedule

2 weeks at 10 hours a week

Learn at your own pace

88%

Most learners liked this course

What you'll learn

Describe and contrast Extract, Transform, Load (ETL) processes and Extract, Load, Transform (ELT) processes.
Explain batch vs concurrent modes of execution.
Implement ETL workflow through bash and Python functions.
Describe data pipeline components, processes, tools, and technologies.

Skills you'll gain

Tools you'll learn

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

11 assignments¹

AI Graded see disclaimer

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is available as part of

When you enroll in this course, you'll also be asked to select a specific program.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 5 modules in this course

Delve into the two different approaches to converting raw data into analytics-ready data. One approach is the Extract, Transform, Load (ETL) process. The other contrasting approach is the Extract, Load, and Transform (ELT) process. ETL processes apply to data warehouses and data marts. ELT processes apply to data lakes, where the data is transformed on demand by the requesting/calling application.

In this course, you will learn about the different tools and techniques that are used with ETL and Data pipelines. Both ETL and ELT extract data from source systems, move the data through the data pipeline, and store the data in destination systems. During this course, you will experience how ELT and ETL processing differ and identify use cases for both. You will identify methods and tools used for extracting the data, merging extracted data either logically or physically, and for loading data into data repositories. You will also define transformations to apply to source data to make the data credible, contextual, and accessible to data users. You will be able to outline some of the multiple methods for loading data into the destination system, verifying data quality, monitoring load failures, and the use of recovery mechanisms in case of failure. By the end of this course, you will also know how to use Apache Airflow to build data pipelines as well be knowledgeable about the advantages of using this approach. You will also learn how to use Apache Kafka to build streaming pipelines as well as the core components of Kafka which include: brokers, topics, partitions, replications, producers, and consumers. Finally, you will complete a shareable final project that enables you to demonstrate the skills you acquired in each module.

ETL or Extract, Transform, and Load processes are used for cases where flexibility, speed, and scalability of data are important. You will explore some key differences between similar processes, ETL and ELT, which include the place of transformation, flexibility, Big Data support, and time-to-insight. You will learn that there is an increasing demand for access to raw data that drives the evolution from ETL to ELT. Data extraction involves advanced technologies including database querying, web scraping, and APIs. You will also learn that data transformation is about formatting data to suit the application and that data is loaded in batches or streamed continuously.

What's included

7 videos3 readings2 assignments1 plugin

7 videosTotal 32 minutes

Course Intro video5 minutes
ETL Fundamentals5 minutes
ELT Basics4 minutes
Comparing ETL and ELT4 minutes
Data Extraction Techniques4 minutes
Introduction to Data Transformation Techniques4 minutes
Data Loading Techniques4 minutes

3 readingsTotal 9 minutes

IBM Product Spotlight: IBM Instana2 minutes
Course Introduction4 minutes
Summary & Highlights3 minutes

2 assignmentsTotal 40 minutes

Graded Quiz: ETL and ELT Processes30 minutes
ETL and ELT Processes10 minutes

1 pluginTotal 5 minutes

Interactivity: Tell the Difference between ETL and ELT5 minutes

Extract, transform and load (ETL) pipelines are created with Bash scripts that can be run on a schedule using cron. Data pipelines move data from one place, or form, to another. Data pipeline processes include scheduling or triggering, monitoring, maintenance, and optimization. Furthermore, Batch pipelines extract and operate on batches of data. Whereas streaming data pipelines ingest data packets one-by-one in rapid succession. In this module, you will learn that streaming pipelines apply when the most current data is needed. You will explore that parallelization and I/O buffers help mitigate bottlenecks. You will also learn how to describe data pipeline performance in terms of latency and throughput.

What's included

5 videos4 readings4 assignments1 app item1 plugin

5 videosTotal 25 minutes

ETL Using Shell Scripting5 minutes
Introduction to Data Pipelines4 minutes
Key Data Pipeline Processes5 minutes
Batch versus Streaming Data Pipeline Use Cases5 minutes
Data Pipeline Tools and Technologies7 minutes

4 readingsTotal 15 minutes

Linux Commands and Shell Scripting2 minutes
ETL Techniques10 minutes
Summary & Highlights1 minute
Summary & Highlights2 minutes

4 assignmentsTotal 80 minutes

Graded Quiz: ETL using Shell Scripts30 minutes
Graded Quiz: An Introduction to Data Pipelines30 minutes
Practice Quiz: ETL using Shell Scripts10 minutes
Practice Quiz: An Introduction to Data Pipelines10 minutes

1 app itemTotal 30 minutes

Hands-On Lab: ETL using Shell Scripts30 minutes

1 pluginTotal 10 minutes

Interactivity: Differentiate between Batch Processing and Stream Processing10 minutes

The key advantage of Apache Airflow's approach to representing data pipelines as DAGs is that they are expressed as code, which makes your data pipelines more maintainable, testable, and collaborative. Tasks, the nodes in a DAG, are created by implementing Airflow's built-in operators. In this module, you will learn about Apache Airflow having a rich UI that simplifies working with data pipelines. You will explore how to visualize your DAG in graph or tree mode. You will also learn about the key components of a DAG definition file, and you will learn that Airflow logs are saved into local file systems and then sent to cloud storage, search engines, and log analyzers.

What's included

5 videos1 reading2 assignments4 app items1 plugin

5 videosTotal 25 minutes

Apache Airflow Overview6 minutes
Advantages of Representing Data Pipelines as DAGs in Apache Airflow7 minutes
Apache Airflow UI4 minutes
Build a DAG Using Airflow4 minutes
Airflow Logging and Monitoring4 minutes

1 readingTotal 3 minutes

Summary & Highlights3 minutes

2 assignmentsTotal 40 minutes

Graded Quiz: Building Data Pipelines using Airflow30 minutes
Practice Quiz: Building Data Pipelines using Airflow10 minutes

4 app itemsTotal 120 minutes

Hands-on Lab: Getting Started with Apache Airflow20 minutes
Hands-on Lab: Create a DAG for Apache Airflow with PythonOperator40 minutes
Hands-on Lab: Create a DAG for Apache Airflow with BashOperator40 minutes
Hands-on Lab: Monitoring a DAG20 minutes

1 pluginTotal 15 minutes

Reading: DAG Structure and Operators15 minutes

Apache Kafka is a very popular open source event streaming pipeline. An event is a type of data that describes the entity’s observable state updates over time. Popular Kafka service providers include Confluent Cloud, IBM Event Stream, and Amazon MSK. Additionally, Kafka Streams API is a client library supporting you with data processing in event streaming pipelines. In this module, you will learn that the core components of Kafka are brokers, topics, partitions, replications, producers, and consumers. You will explore two special types of processors in the Kafka Stream API stream-processing topology: The source processor and the sink processor. You will also learn about building event streaming pipelines using Kafka.

What's included

4 videos1 reading2 assignments3 app items1 plugin

4 videosTotal 26 minutes

Distributed Event Streaming Platform Components6 minutes
Apache Kafka Overview6 minutes
Building Event Streaming Pipelines using Kafka10 minutes
Kafka Streaming Process5 minutes

1 reading

Summary & Highlights0 minutes

2 assignmentsTotal 40 minutes

Graded Quiz: Building Streaming Pipelines using Kafka30 minutes
Practice Quiz: Building Streaming Pipelines using Kafka10 minutes

3 app itemsTotal 90 minutes

Hands-on Lab: Working with Streaming Data using Kafka20 minutes
[Optional] Hands-on Lab: Kafka Message Keys and Offset40 minutes
[Optional] Hands-on Lab: Kafka Python Client30 minutes

1 pluginTotal 30 minutes

Kafka Python Client30 minutes

In this final assignment module, you will apply your newly gained knowledge to explore very exciting hands-on labs. “Creating ETL Data Pipelines using Apache Airflow”. You will explore building these ETL pipelines using real-world scenarios.

What's included

5 readings1 assignment1 peer review4 app items1 plugin

5 readingsTotal 25 minutes

Project Overview10 minutes
Graded Timed Final Exam Instructions10 minutes
What's Next: Explore IBM Instana1 minute
Congrats & Next Steps2 minutes
Thanks from the Course Team2 minutes

1 assignmentTotal 90 minutes

Timed Final Quiz 90 minutes

1 peer reviewTotal 30 minutes

Option 2: Peer Review: Project Submission and Peer Review30 minutes

4 app itemsTotal 275 minutes

Option 1: AI-Graded - Final Submission and Evaluation50 minutes
Hands-on Lab: Build ETL Data Pipelines with BashOperator using Apache Airflow90 minutes
[Optional] Hands-on Lab: Build an ETL Pipeline using PythonOperator with Apache Airflow90 minutes
[Optional] Hands-on Lab: Build a Streaming ETL Pipeline using Kafka45 minutes

1 pluginTotal 15 minutes

Reading: Final Submission Guidelines and Deliverables15 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructors

Instructor ratings

(110 ratings)

Jeff Grossman

IBM

3 Courses744,420 learners

Yan Luo

IBM

7 Courses405,771 learners

Offered by

IBM

Explore more from Data Management

Status: Free Trial
Coursera
Building Automated Data Pipelines with Spark,dbt,and Airflow
Course
Status: Free Trial
EDUCBA
Apache Spark: Design & Execute ETL Pipelines Hands-On
Course
Status: Preview
Edureka
Data Engineering Workflow Orchestration with Airflow
Course
Status: Preview
Coursera
ETL Testing Basics for Databases
Course

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

5 stars
71.55%
4 stars
17.50%
3 stars
6.12%
2 stars
2.40%
1 star
2.40%

Showing 3 of 457

Reviewed on Jul 22, 2023

Labs in this course are very helpful and to the point. It took me a while to complete this course but i learned a lot.

Reviewed on Jan 20, 2025

Relevant information in recordings, good recap of every video and hand-on lesson in the end to concrete the knowledge.

Reviewed on Sep 6, 2022

Very useful high-level overview with practical examples of the major technologies that drive modern data pipelines.

View more reviews

Open new doors with Coursera Plus

Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Certificate, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.