What types of data processing tasks will I be able to perform after completing the course?

You will be able to perform a variety of tasks, including data cleaning, transformation, aggregation, and analysis of large datasets using PySpark’s RDDs and DataFrames.

What technologies and frameworks are covered in the course?

You’ll learn PySpark in detail, along with its integration with Hadoop, RDDs, DataFrames, and SQL-based data processing.

Is prior knowledge in data engineering required?

No, prior experience is not required; the course introduces PySpark basics before moving to advanced use cases.

Does the course cover workflow automation and ETL?

Yes, you’ll learn how to design ETL workflows and automate big data processing with PySpark.

Can I preview a course before enrolling?

Yes, you can preview the first video and view the syllabus before you enroll. You must purchase the course to access content not included in the preview.

When will I have access to the lectures and assignments?

If you decide to enroll in the course before the session start date, you will have access to all of the lecture videos and readings for the course. You’ll be able to submit assignments once the session starts.

What will I get when I enroll?

Once you enroll and your session begins, you will have access to all videos and other resources, including reading items and the course discussion forum. You’ll be able to view and submit practice assessments, and complete required graded assignments to earn a grade and a Course Certificate.

When will I receive my Course Certificate?

If you complete the course successfully, your electronic Course Certificate will be added to your Accomplishments page - from there, you can print your Course Certificate or add it to your LinkedIn profile.

Why can’t I audit this course?

This course is currently available only to learners who have paid or received financial aid, when available.

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

PySpark in Action: Hands-On Data Processing

Ce cours n'est pas disponible en Français (France)

Nous sommes actuellement en train de le traduire dans plus de langues.

PySpark in Action: Hands-On Data Processing

Ce cours fait partie de Spécialisation "PySpark for Data Science"

Instructeur : Edureka

Inclus avec

5 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Intermédiaire

Expérience recommandée

2 semaines à compléter

à 10 heures par semaine

Planning flexible

Apprenez à votre propre rythme

5 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Intermédiaire

Expérience recommandée

2 semaines à compléter

à 10 heures par semaine

Planning flexible

Apprenez à votre propre rythme

Ce que vous apprendrez

Explore the fundamental concepts of Big Data and the components of the Hadoop ecosystem.
Explain the architecture and key principles of Apache Spark and its role in big data processing.
Utilize RDD transformations and actions to effectively process large-scale datasets with PySpark.
Execute advanced DataFrame operations, including data manipulation and aggregation techniques.

Compétences que vous acquerrez

Catégorie : Data Storage Technologies
Catégorie : Data Manipulation
Catégorie : Data Pipelines
Catégorie : SQL
Catégorie : Data Wrangling
Catégorie : Data Architecture
Catégorie : Distributed Computing
Catégorie : Data Storage
Catégorie : Data Transformation
Catégorie : Big Data
Catégorie : Data Processing
Catégorie : Performance Tuning
Catégorie : Data Integration

Outils que vous découvrirez

Catégorie : Apache Hadoop
Catégorie : Apache Spark
Catégorie : PySpark

Détails à connaître

Certificat partageable

Ajouter à votre profil LinkedIn

Évaluations

17 devoirs

Enseigné en Anglais

Découvrez comment les employés des entreprises prestigieuses maîtrisent des compétences recherchées

En savoir plus sur Coursera pour les affaires

logos de Petrobras, TATA, Danone, Capgemini, P&G et L'Oreal

Élaborez votre expertise du sujet

Ce cours fait partie de la Spécialisation "PySpark for Data Science"

Lorsque vous vous inscrivez à ce cours, vous êtes également inscrit(e) à cette Spécialisation.

Apprenez de nouveaux concepts auprès d'experts du secteur
Acquérez une compréhension de base d'un sujet ou d'un outil
Développez des compétences professionnelles avec des projets pratiques
Obtenez un certificat professionnel partageable

Il y a 5 modules dans ce cours

PySpark in Action: Hands-on Data Processing is a practical course that equips you to work confidently with large-scale data using PySpark and distributed data processing frameworks. You’ll discover the fundamentals of Big Data, Apache Hadoop, and Apache Spark, then build on this knowledge through real-world exercises where you’ll process and analyze massive datasets.

During the course, you’ll gain hands-on experience with: - Foundational concepts of Big Data and components of the Hadoop ecosystem such as HDFS, enabling you to understand modern data storage and processing. - Spark architecture and critical design principles for scalable, fault-tolerant data workflows. - RDD transformations and actions, helping you handle large-scale datasets using PySpark’s distributed processing engine. - Advanced DataFrame techniques: manage complex data types, perform aggregations, and solve business data challenges efficiently. - PySpark SQL for applying advanced queries, optimizing processing workflows, and enabling rapid, reliable analysis at scale. This course is ideal for those new to data engineering or distributed computing who want a hands-on introduction to PySpark for large-scale data tasks. If you have basic Python skills but no prior experience in data engineering, you’ll find accessible explanations and step-by-step projects throughout. By course completion, you’ll be prepared to use PySpark in real-world projects, build and monitor data pipelines, automate processing, clean and integrate diverse datasets, and confidently tackle core challenges in distributed data analytics.

This module introduces you to the fundamental concepts of Big Data and Hadoop. You will explore the Hadoop ecosystem, its components, and the Hadoop Distributed File System (HDFS), setting the foundation for understanding big data processing and storage solutions.

Inclus

15 vidéos5 lectures4 devoirs1 sujet de discussion

15 vidéosTotal 74 minutes

Course Introduction4 minutes
What is Big Data?4 minutes
Applications of Big Data5 minutes
What is Hadoop?5 minutes
Hadoop Ecosystem2 minutes
Working of HDFS5 minutes
Introduction to Apache Spark7 minutes
Master-slave Architecture7 minutes
Spark Architecture2 minutes
Data Processing with Apache Spark6 minutes
Directed Acyclic Graph (DAG)5 minutes
Introduction to Spark Ecosystem5 minutes
What is PySpark?5 minutes
Key Features of PySpark7 minutes
Basics of Python6 minutes

5 lecturesTotal 50 minutes

Welcome to PySpark in Action: Hands-On Data Processing10 minutes
What is Big Data? – A Beginner’s Guide to the World of Big Data10 minutes
Spark SQL10 minutes
Features of PySpark10 minutes
Module Summary: Big Data Processing with PySpark10 minutes

4 devoirsTotal 38 minutes

Knowledge Check: Big Data Processing with PySpark20 minutes
Practice Quiz: Big Data Essentials6 minutes
Practice Quiz: Apache Spark Fundamentals6 minutes
Practice Quiz: PySpark 6 minutes

1 sujet de discussionTotal 10 minutes

Introduce Yourself10 minutes

Dive into the core of PySpark by learning about Resilient Distributed Datasets (RDDs). This module covers the fundamentals of RDDs, how they work, and their key transformations and actions, enabling efficient distributed data processing in PySpark.

Inclus

25 vidéos4 lectures4 devoirs3 sujets de discussion

25 vidéosTotal 121 minutes

Introduction to RDDs6 minutes
Working of RDDs5 minutes
Creating RDDs7 minutes
Essentials of RDD6 minutes
Key Concepts of RDD6 minutes
Understanding Lazy Evaluations5 minutes
Advantages of Lazy Evaluation3 minutes
Introduction to Transformations5 minutes
Narrow and Wide Transformations6 minutes
Transformations: Map6 minutes
Transformations: Filter, Reduce and groupBykey4 minutes
Transformations: Distinct, Sample and Join 5 minutes
Transformations: Union and Subtract3 minutes
Introduction to Repartition6 minutes
Significance of Repartition1 minute
Introduction to Actions5 minutes
Actions: collect, reduce and reduceBykey5 minutes
Implementing Actions: collect, reduce and reduceBykey3 minutes
Actions: count, foreach and aggregate6 minutes
Implementing Actions: count, foreach and aggregate3 minutes
Actions: Coalesce, histogram and sortby4 minutes
Implementing Actions: Coalesce, histogram and sortby3 minutes
Working with RDD Transformations6 minutes
Applying Distinct, sample and join Transformations3 minutes
Grocery Store Data Analysis with PySPark RDDs7 minutes

4 lecturesTotal 40 minutes

PySpark RDDs in Organization10 minutes
Managing RDD Transformations in PySpark10 minutes
Optimizing RDD operations in PySpark10 minutes
Module Summary: Working with RDD10 minutes

4 devoirsTotal 38 minutes

Knowledge Check: Working with RDD20 minutes
Introduction to RDD6 minutes
RDD Transformations6 minutes
RDD Actions6 minutes

3 sujets de discussionTotal 30 minutes

Introduction to RDDs10 minutes
Transformations: Map10 minutes
Actions: Coalesce, histogram, and sortBy10 minutes

This module covers the creation and manipulation of DataFrames in PySpark. You will learn how to perform basic and advanced operations, including aggregation, grouping, and handling missing data, with a focus on optimizing large-scale data processing tasks.

Inclus

22 vidéos4 lectures4 devoirs1 sujet de discussion

22 vidéosTotal 116 minutes

Overview of Data frames7 minutes
Introduction to DataFrames API4 minutes
Creating Data Frames from Different Sources7 minutes
Data Frames from RDD6 minutes
Basic DataFrame Operations6 minutes
Implementation of DataFrame Operations4 minutes
Performing Aggregations and Groupings - GroupBy and Window6 minutes
Performing Aggregations and Groupings - Cube and Rollup4 minutes
Handling Missing Data - Managing Null Values7 minutes
Demonstration for Handling Missing Data4 minutes
Working with Complex Data Types - Arrays and Structs7 minutes
Demonstration for Working with Complex Data Types3 minutes
Advanced DataFrame Transformations and Actions7 minutes
Demonstration: Working with DataFrames7 minutes
Introduction to Data Visualization and Key Aspects5 minutes
Introduction to Data Visualization - General Visuals4 minutes
Libraries for Data Visualization - Matplotlib and Seaborn4 minutes
Libraries for Data Visualization - Plotly4 minutes
Implementing Data Visualization6 minutes
Implementing Data Visualization - Plotting Charts6 minutes
Customizing the Visualizations 4 minutes
Customizing Charts and Visuals6 minutes

4 lecturesTotal 40 minutes

Importance of PySpark DataFrames10 minutes
Window Functions in PySpark10 minutes
Data Visualization Libraries in PySpark10 minutes
Module Summary: PySpark DataFrames10 minutes

4 devoirsTotal 38 minutes

Knowledge Check: PySpark Dataframes20 minutes
Introduction to PySpark DataFrames6 minutes
Advanced DataFrame Operations6 minutes
Data Visualizations with PySpark DataFrames6 minutes

1 sujet de discussionTotal 5 minutes

PySpark DataFrames and Traditional Pandas DataFrames5 minutes

In this module, you will explore the SQL capabilities of PySpark. Learn how to perform CRUD operations, execute SQL commands, and merge and aggregate data using PySpark SQL. You'll also discover best practices for using SQL with PySpark to enhance data workflows.

Inclus

28 vidéos4 lectures4 devoirs2 sujets de discussion

28 vidéosTotal 135 minutes

Structured Data vs. Unstructured Data5 minutes
Characteristic of Structured Data 5 minutes
Relational Database and its Components7 minutes
SQL in Relation with Relational Database6 minutes
Normalization and its Types6 minutes
Exploring Different Types of Normalization4 minutes
Data Querying and Filtering Logic6 minutes
DDL Commands - Creating Tables5 minutes
DDL Commands - Altering and Truncating Tables4 minutes
DQL Commands - Select Statement and Where Clause4 minutes
DQL Commands - Practical Implementation4 minutes
DML Commands - Insert, Update, and Delete4 minutes
DML Commands - Lock4 minutes
DCL Commands7 minutes
TCL Commands6 minutes
Alter - Altering a Table and Constraints5 minutes
Alter - Altering Indexes and Views3 minutes
Performing CRUD Operations6 minutes
Operations on PySpark SQL DataFrames4 minutes
Performing Operations on PySpark SQL DataFrames7 minutes
Data Merging and Aggregation using PySpark SQL5 minutes
Implementing Data Merging and Aggregation using PySpark SQL4 minutes
SQL Best Practices6 minutes
Data Integrity and Error Handling with PySpark3 minutes
Problem Statement: Ecommerce Organization 4 minutes
Data Analysis of an E-commerce Organization4 minutes
Demonstration: Spark SQL - Retail Organization4 minutes
Demonstration: Analyzing the Data4 minutes

4 lecturesTotal 34 minutes

Best Practices for Data Querying: Optimizing SQL Performance8 minutes
User-Defined Functions (UDFs) in PySpark8 minutes
Best Practices for Using SQL with PySpark8 minutes
Module Summary: PySpark SQL10 minutes

4 devoirsTotal 38 minutes

Knowledge Check: PySpark SQL20 minutes
Introduction to SQL6 minutes
SQL Commands6 minutes
Working with PySpark SQL6 minutes

2 sujets de discussionTotal 10 minutes

Why Normalization is Crucial for Database Design?5 minutes
Importance of Aggregate Functions 5 minutes

This module is meant to test how well you understand the different ideas and lessons you've learned in this course. You will undertake a project based on these PySpark concepts and complete a comprehensive quiz that will assess your confidence and proficiency in Data Processing with PySpark.

Inclus

1 vidéo1 lecture1 devoir1 sujet de discussion

Obtenez un certificat professionnel

Ajoutez ce titre à votre profil LinkedIn, à votre curriculum vitae ou à votre CV. Partagez-le sur les médias sociaux et dans votre évaluation des performances.

Instructeur

Évaluations de l’enseignant

(5 évaluations)

Edureka

178 Cours164 471 apprenants

Offert par

Edureka

En savoir plus sur Data Analysis

EDUCBA
PySpark & Python: Hands-On Guide to Data Processing
Cours
Statut : Essai gratuit
Catégorie : Crédit proposé
Edureka
Introduction to PySpark
Cours
Catégorie : Prévisualisation
Catégorie : Crédit proposé
EDUCBA
PySpark: Apply & Analyze Advanced Data Processing
Cours
Statut : Essai gratuit
Catégorie : Crédit proposé
Coursera
PySpark Foundations: Process, analyze, and summarize data
Projet Guidé
Catégorie : Crédit proposé

Pour quelles raisons les étudiants sur Coursera nous choisissent-ils pour leur carrière ?

Felipe M.

Étudiant(e) depuis 2018

’Pouvoir suivre des cours à mon rythme à été une expérience extraordinaire. Je peux apprendre chaque fois que mon emploi du temps me le permet et en fonction de mon humeur.’

Jennifer J.

Étudiant(e) depuis 2020

’J'ai directement appliqué les concepts et les compétences que j'ai appris de mes cours à un nouveau projet passionnant au travail.’

Larry W.

Étudiant(e) depuis 2021

’Lorsque j'ai besoin de cours sur des sujets que mon université ne propose pas, Coursera est l'un des meilleurs endroits où se rendre.’

Chaitanya A.

’Apprendre, ce n'est pas seulement s'améliorer dans son travail : c'est bien plus que cela. Coursera me permet d'apprendre sans limites.’

Ouvrez de nouvelles portes avec Coursera Plus

Accès illimité à 10,000+ cours de niveau international, projets pratiques et programmes de certification prêts à l'emploi - tous inclus dans votre abonnement.

Faites progresser votre carrière avec un diplôme en ligne

Obtenez un diplôme auprès d’universités de renommée mondiale - 100 % en ligne

Découvrir les diplômes

Rejoignez plus de 3 400 entreprises mondiales qui ont choisi Coursera pour les affaires

Améliorez les compétences de vos employés pour exceller dans l’économie numérique

Foire Aux Questions

You will need access to a computer with Python and Apache Spark installed. Detailed setup instructions will be provided at the beginning of the course.

This course is designed for individuals new to big data and PySpark, providing a solid foundation to start working with distributed data processing.

While prior SQL knowledge is beneficial, it is not mandatory. The course will introduce SQL concepts as they relate to PySpark and provide practice with SQL queries.