Spark, Hadoop, and Snowflake for Data Engineering

Spark, Hadoop, and Snowflake for Data Engineering

This course is part of Applied Python Data Engineering Specialization

Instructors: Noah Gift

Access provided by EmployNV

14,377 already enrolled

4 modules

Gain insight into a topic and learn the fundamentals.

67 reviews

Advanced level

Recommended experience

3 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

4 modules

Gain insight into a topic and learn the fundamentals.

67 reviews

Advanced level

Recommended experience

3 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Create scalable data pipelines (Hadoop, Spark, Snowflake, Databricks) for efficient data handling.
Optimize data engineering with clustering and scaling to boost performance and resource use.
Build ML solutions (PySpark, MLFlow) on Databricks for seamless model development and deployment.
Implement DataOps and DevOps practices for continuous integration and deployment (CI/CD) of data-driven applications, including automating processes.

Skills you'll gain

Tools you'll learn

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

21 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the Applied Python Data Engineering Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 4 modules in this course

e.g. This is primarily aimed at first- and second-year undergraduates interested in engineering or science, along with high school students and professionals with an interest in programmingGain the skills for building efficient and scalable data pipelines. Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) as well as learn how to optimize and manage them. Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks, while honing your Python data science skills with PySpark. Finally, discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks.

This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. In addition to the technologies you will learn, you will also gain methodologies to help you hone your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops methodologies and best practices. With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.

In this module, you will learn how to work with different data engineering platforms, such as Hadoop and Spark, and apply their concepts to real-world scenarios. First, you will explore the fundamentals of Hadoop to store and process big data. Next, you will delve into Spark concepts, distributed computing, deferred execution, and Spark SQL. By the end of the week, you will gain hands-on experience with PySpark DataFrames, DataFrame methods, and deferred execution strategies.

What's included

10 videos10 readings7 assignments1 discussion prompt2 ungraded labs

10 videosTotal 25 minutes

Meet your Co-Instructor: Kennedy Behrman1 minute
Meet your Co-Instructor: Noah Gift1 minute
Overview of Big Data Platforms2 minutes
Getting Started with Hadoop1 minute
Getting Started with Spark2 minutes
Introduction to Resilient Distributed Datasets (RDD)2 minutes
Resilient Distributed Datasets (RDD) Demo4 minutes
Introduction to Spark SQL2 minutes
PySpark Dataframe Demo: Part 13 minutes
PySpark Dataframe Demo: Part 27 minutes

10 readingsTotal 100 minutes

Welcome to Data Engineering Platforms with Python!10 minutes
Report a problem with the course10 minutes
What is Apache Hadoop?10 minutes
What is Apache Spark?10 minutes
Use Apache Spark in Azure Databricks (optional)10 minutes
Choosing between Hadoop and Spark10 minutes
What are RDDs?10 minutes
Getting Started: Creating RDD's with PySpark10 minutes
Spark SQL, Dataframes and Datasets10 minutes
PySpark and Spark SQL10 minutes

7 assignmentsTotal 210 minutes

PySpark30 minutes
Big Data Platforms30 minutes
Apache Hadoop Concepts30 minutes
Apache Spark Concepts30 minutes
RDD Concepts30 minutes
Spark SQL Concepts30 minutes
PySpark Dataframe Concepts30 minutes

1 discussion promptTotal 10 minutes

Meet and Greet (optional)10 minutes

2 ungraded labsTotal 120 minutes

Practice: Creating RDD's with PySpark60 minutes
Practice: Reading Data into Dataframes60 minutes

In this module, you will explore the Snowflake platform, gaining insights into its architecture and key concepts. Through hands-on practice in the Snowflake Web UI, you'll learn to create tables, manage warehouses, and use the Snowflake Python Connector to interact with tables. By the end of this week, you'll solidify your understanding of Snowflake's architecture and practical applications, emerging with the ability to effectively navigate and leverage the platform for data management and analysis.

What's included

8 videos5 readings6 assignments

8 videosTotal 27 minutes

What is Snowflake?2 minutes
Snowflake Layers2 minutes
Snowflake Web UI4 minutes
Navigating Snowflake4 minutes
Creating a Table in Snowflake5 minutes
Snowflake Warehouses4 minutes
Writing to Snowflake3 minutes
Reading from Snowflake3 minutes

5 readingsTotal 50 minutes

Accessing Snowflake10 minutes
Detailed View Inside Snowflake10 minutes
Snowsight: The Snowflake Web Interface10 minutes
Working with Warehouses10 minutes
Python Connector Documentation10 minutes

6 assignmentsTotal 180 minutes

Snowflake30 minutes
Snowflake Architecture30 minutes
Snowflake Layers30 minutes
Navigating Snowflake30 minutes
Creating a Table30 minutes
Writing to Snowflake30 minutes

In this module, you will practice the essential skills for seamlessly managing machine learning workflows using Databricks and MLFlow. First, you will create a Databricks workspace and configure a cluster, setting the stage for efficient data analysis. Next, you will load a sample dataset into the Databricks workspace using the power of PySpark, enabling data manipulation and exploration. Finally, you will install MLFlow either locally or within the Databricks environment, gaining the ability to orchestrate the entire machine learning lifecycle. By the end of this week, you will be able to craft, track, and manage machine learning experiments within Databricks, ensuring precision, reproducibility, and optimal decision-making throughout your data-driven journey.

What's included

16 videos7 readings4 assignments1 ungraded lab

16 videosTotal 72 minutes

Accessing Databricks1 minute
Spark Notebooks with Databricks5 minutes
Using Data with Databricks5 minutes
Working with Workspaces in Databricks3 minutes
Advanced Capabilities of Databricks2 minutes
PySpark Introduction on Databricks7 minutes
Exploring Databricks Azure Features4 minutes
Using the DBFS to AutoML Workflow4 minutes
Load, Register and Deploy ML Models3 minutes
Databricks Model Registry3 minutes
Model Serving on Databricks2 minutes
What is MLOps?13 minutes
Exploring Open-Source MLFlow Frameworks6 minutes
Running MLFlow with Databricks6 minutes
End to End Databricks MLFlow4 minutes
Databricks Autologging with MLFlow4 minutes

7 readingsTotal 70 minutes

What is Azure Databricks?10 minutes
Introduction to Databricks Machine Learning10 minutes
What is the Databricks File System (DBFS)?10 minutes
Serverless Compute with Databricks10 minutes
MLOps Workflow on Azure Databricks10 minutes
Run MLFlow Projects on Azure Databricks10 minutes
Databricks Autologging10 minutes

4 assignmentsTotal 120 minutes

DataBricks30 minutes
PySpark SQL30 minutes
PySpark DataFrames30 minutes
MLFlow with Databricks30 minutes

1 ungraded labTotal 60 minutes

ETL-Part-1: Keyword Extractor Tool to HashTag Tool 60 minutes

In this module, you will explore the concepts of Kaizen, DevOps, and DataOps and how these methodologies synergistically contribute to efficient and seamless data engineering workflows. Through practical examples, you will learn how Kaizen's continuous improvement philosophy, DevOps' collaborative practices, and DataOps' focus on data quality and integration converge to enhance the development, deployment, and management of data engineering platforms. By the end of this week, you will have the knowledge and perspective needed to optimize data engineering processes and deliver scalable, reliable, and high-quality solutions.