PySpark in Action: Hands-On Data Processing

Give your career the gift of Coursera Plus with $160 off, billed annually. Save today.

PySpark in Action: Hands-On Data Processing

This course is part of PySpark for Data Science Specialization

Instructor: Edureka

Included with Coursera Plus

Learn more

5 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

15 hours to complete

3 weeks at 5 hours a week

Flexible schedule

Learn at your own pace

5 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

15 hours to complete

3 weeks at 5 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Explore the fundamental concepts of Big Data and the components of the Hadoop ecosystem.
Explain the architecture and key principles of Apache Spark and its role in big data processing.
Utilize RDD transformations and actions to effectively process large-scale datasets with PySpark.
Execute advanced DataFrame operations, including data manipulation and aggregation techniques.

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

Build your subject-matter expertise

This course is part of the PySpark for Data Science Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

There are 5 modules in this course

PySpark in Action: Hands-on Data Processing is a foundational course designed to help you begin working with PySpark and distributed data processing. You will explore the essential concepts of Big Data, Hadoop, and Apache Spark, and gain practical experience using PySpark to process and analyze large datasets. Through hands-on exercises, you will work with RDDs, DataFrames, and SQL queries in PySpark, giving you the skills to manage data at scale.

By the end of this course, you will be able to: - Explore foundational concepts of Big Data and the components of the Hadoop ecosystem - Explain the architecture and key principles underlying Apache Spark - Utilize RDD transformations and actions to process large-scale datasets with PySpark - Execute advanced DataFrame operations, including handling complex data types and performing aggregations - Evaluate and enhance data processing workflows by leveraging PySpark SQL and advanced DataFrame techniques This course is ideal for learners who are new to data engineering and want to understand how to use PySpark effectively. Basic knowledge in Python is recommended, but no prior experience with PySpark is necessary. Start your journey with PySpark and build a strong foundation in distributed data processing!

This module introduces you to the fundamental concepts of Big Data and Hadoop. You will explore the Hadoop ecosystem, its components, and the Hadoop Distributed File System (HDFS), setting the foundation for understanding big data processing and storage solutions.

What's included

15 videos5 readings4 assignments1 discussion prompt

15 videosTotal 74 minutes

Course Introduction4 minutesPreview module
What is Big Data?4 minutes
Applications of Big Data4 minutes
What is Hadoop?5 minutes
Hadoop Ecosystem2 minutes
Working of HDFS5 minutes
Introduction to Apache Spark6 minutes
Master-slave Architecture6 minutes
Spark Architecture1 minute
Data Processing with Apache Spark5 minutes
Directed Acyclic Graph (DAG)5 minutes
Introduction to Spark Ecosystem5 minutes
What is PySpark?4 minutes
Key Features of PySpark6 minutes
Basics of Python5 minutes

5 readingsTotal 50 minutes

Welcome to PySpark in Action: Hands-On Data Processing10 minutes
What is Big Data? – A Beginner’s Guide to the World of Big Data10 minutes
Spark SQL10 minutes
Features of PySpark10 minutes
Module Summary: Big Data Processing with PySpark10 minutes

4 assignmentsTotal 38 minutes

Practice Quiz: Big Data Essentials6 minutes
Practice Quiz: Apache Spark Fundamentals6 minutes
Practice Quiz: PySpark 6 minutes
Knowledge Check: Big Data Processing with PySpark20 minutes

1 discussion promptTotal 10 minutes

Introduce Yourself10 minutes

Dive into the core of PySpark by learning about Resilient Distributed Datasets (RDDs). This module covers the fundamentals of RDDs, how they work, and their key transformations and actions, enabling efficient distributed data processing in PySpark.

What's included

25 videos4 readings4 assignments3 discussion prompts

25 videosTotal 121 minutes

Introduction to RDDs6 minutesPreview module
Working of RDDs4 minutes
Creating RDDs6 minutes
Essentials of RDD6 minutes
Key Concepts of RDD6 minutes
Understanding Lazy Evaluations4 minutes
Advantages of Lazy Evaluation3 minutes
Introduction to Transformations5 minutes
Narrow and Wide Transformations5 minutes
Transformations: Map5 minutes
Transformations: Filter, Reduce and groupBykey4 minutes
Transformations: Distinct, Sample and Join 5 minutes
Transformations: Union and Subtract3 minutes
Introduction to Repartition6 minutes
Significance of Repartition1 minute
Introduction to Actions5 minutes
Actions: collect, reduce and reduceBykey5 minutes
Implementing Actions: collect, reduce and reduceBykey2 minutes
Actions: count, foreach and aggregate6 minutes
Implementing Actions: count, foreach and aggregate2 minutes
Actions: Coalesce, histogram and sortby4 minutes
Implementing Actions: Coalesce, histogram and sortby3 minutes
Working with RDD Transformations6 minutes
Applying Distinct, sample and join Transformations2 minutes
Grocery Store Data Analysis with PySPark RDDs7 minutes

4 readingsTotal 40 minutes

PySpark RDDs in Organization10 minutes
Managing RDD Transformations in PySpark10 minutes
Optimizing RDD operations in PySpark10 minutes
Module Summary: Working with RDD10 minutes

4 assignmentsTotal 38 minutes

Introduction to RDD6 minutes
RDD Transformations6 minutes
RDD Actions6 minutes
Knowledge Check: Working with RDD20 minutes

3 discussion promptsTotal 30 minutes

Introduction to RDDs10 minutes
Transformations: Map10 minutes
Actions: Coalesce, histogram, and sortBy10 minutes

This module covers the creation and manipulation of DataFrames in PySpark. You will learn how to perform basic and advanced operations, including aggregation, grouping, and handling missing data, with a focus on optimizing large-scale data processing tasks.

What's included

22 videos4 readings4 assignments1 discussion prompt

22 videosTotal 116 minutes

Overview of Data frames7 minutesPreview module
Introduction to DataFrames API4 minutes
Creating Data Frames from Different Sources6 minutes
Data Frames from RDD6 minutes
Basic DataFrame Operations6 minutes
Implementation of DataFrame Operations4 minutes
Performing Aggregations and Groupings - GroupBy and Window5 minutes
Performing Aggregations and Groupings - Cube and Rollup4 minutes
Handling Missing Data - Managing Null Values7 minutes
Demonstration for Handling Missing Data3 minutes
Working with Complex Data Types - Arrays and Structs6 minutes
Demonstration for Working with Complex Data Types3 minutes
Advanced DataFrame Transformations and Actions6 minutes
Demonstration: Working with DataFrames6 minutes
Introduction to Data Visualization and Key Aspects4 minutes
Introduction to Data Visualization - General Visuals3 minutes
Libraries for Data Visualization - Matplotlib and Seaborn3 minutes
Libraries for Data Visualization - Plotly3 minutes
Implementing Data Visualization5 minutes
Implementing Data Visualization - Plotting Charts5 minutes
Customizing the Visualizations 4 minutes
Customizing Charts and Visuals5 minutes

4 readingsTotal 40 minutes

Importance of PySpark DataFrames10 minutes
Window Functions in PySpark10 minutes
Data Visualization Libraries in PySpark10 minutes
Module Summary: PySpark DataFrames10 minutes

4 assignmentsTotal 38 minutes

Introduction to PySpark DataFrames6 minutes
Advanced DataFrame Operations6 minutes
Data Visualizations with PySpark DataFrames6 minutes
Knowledge Check: PySpark Dataframes20 minutes

1 discussion promptTotal 5 minutes

PySpark DataFrames and Traditional Pandas DataFrames5 minutes

In this module, you will explore the SQL capabilities of PySpark. Learn how to perform CRUD operations, execute SQL commands, and merge and aggregate data using PySpark SQL. You'll also discover best practices for using SQL with PySpark to enhance data workflows.

What's included

28 videos4 readings4 assignments2 discussion prompts

28 videosTotal 135 minutes

Structured Data vs. Unstructured Data5 minutesPreview module
Characteristic of Structured Data 4 minutes
Relational Database and its Components6 minutes
SQL in Relation with Relational Database6 minutes
Normalization and its Types5 minutes
Exploring Different Types of Normalization4 minutes
Data Querying and Filtering Logic6 minutes
DDL Commands - Creating Tables4 minutes
DDL Commands - Altering and Truncating Tables4 minutes
DQL Commands - Select Statement and Where Clause4 minutes
DQL Commands - Practical Implementation4 minutes
DML Commands - Insert, Update, and Delete3 minutes
DML Commands - Lock4 minutes
DCL Commands6 minutes
TCL Commands6 minutes
Alter - Altering a Table and Constraints5 minutes
Alter - Altering Indexes and Views2 minutes
Performing CRUD Operations6 minutes
Operations on PySpark SQL DataFrames3 minutes
Performing Operations on PySpark SQL DataFrames6 minutes
Data Merging and Aggregation using PySpark SQL4 minutes
Implementing Data Merging and Aggregation using PySpark SQL4 minutes
SQL Best Practices5 minutes
Data Integrity and Error Handling with PySpark2 minutes
Problem Statement: Ecommerce Organization 3 minutes
Data Analysis of an E-commerce Organization4 minutes
Demonstration: Spark SQL - Retail Organization4 minutes
Demonstration: Analyzing the Data3 minutes

4 readingsTotal 34 minutes

Best Practices for Data Querying: Optimizing SQL Performance8 minutes
User-Defined Functions (UDFs) in PySpark8 minutes
Best Practices for Using SQL with PySpark8 minutes
Module Summary: PySpark SQL10 minutes

4 assignmentsTotal 38 minutes

Introduction to SQL6 minutes
SQL Commands6 minutes
Working with PySpark SQL6 minutes
Knowledge Check: PySpark SQL20 minutes

2 discussion promptsTotal 10 minutes

Why Normalization is Crucial for Database Design?5 minutes
Importance of Aggregate Functions 5 minutes

This module is meant to test how well you understand the different ideas and lessons you've learned in this course. You will undertake a project based on these PySpark concepts and complete a comprehensive quiz that will assess your confidence and proficiency in Data Processing with PySpark.

What's included

1 video1 reading1 assignment1 discussion prompt

Instructor

Edureka

47 Courses41,960 learners

Offered by

Edureka

Recommended if you're interested in Data Analysis

Duke University
Spark, Hadoop, and Snowflake for Data Engineering
Course
University of California, Davis
Distributed Computing with Spark SQL
Course
IBM
Introduction to Big Data with Spark and Hadoop
Course
Coursera Project Network
Data Management with Databricks: Big Data with Delta Lakes
Guided Project

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

New to Data Analysis? Start here.

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

You will need access to a computer with Python and Apache Spark installed. Detailed setup instructions will be provided at the beginning of the course.

This course is designed for individuals new to big data and PySpark, providing a solid foundation to start working with distributed data processing.

While prior SQL knowledge is beneficial, it is not mandatory. The course will introduce SQL concepts as they relate to PySpark and provide practice with SQL queries.

PySpark in Action: Hands-On Data Processing

What you'll learn

Skills you'll gain

Details to know

See how employees at top companies are mastering in-demand skills

Build your subject-matter expertise

Earn a career certificate

There are 5 modules in this course

Big Data Processing with PySpark

What's included

Working with RDD

What's included

PySpark DataFrames

What's included

PySpark SQL

What's included

Course Wrap Up and Assessment

What's included

Instructor

Offered by

Recommended if you're interested in Data Analysis

Spark, Hadoop, and Snowflake for Data Engineering

Distributed Computing with Spark SQL

Introduction to Big Data with Spark and Hadoop

Data Management with Databricks: Big Data with Delta Lakes

Why people choose Coursera for their career

New to Data Analysis? Start here.

Open new doors with Coursera Plus

Advance your career with an online degree

Join over 3,400 global companies that choose Coursera for Business

Frequently asked questions

What tools or software do I need to complete the course?

Is this course suitable for beginners in data processing?

Is knowledge of SQL required for this course?

More questions