Packt
Preprocessing Unstructured Data for LLMs and RAG Systems
Packt

Preprocessing Unstructured Data for LLMs and RAG Systems

Included with Coursera Plus

Gain insight into a topic and learn the fundamentals.
Intermediate level

Recommended experience

5 hours to complete
3 weeks at 1 hour a week
Flexible schedule
Learn at your own pace
Gain insight into a topic and learn the fundamentals.
Intermediate level

Recommended experience

5 hours to complete
3 weeks at 1 hour a week
Flexible schedule
Learn at your own pace

What you'll learn

  • Master techniques for preprocessing unstructured data for LLMs and RAG systems.

  • Extract and normalize data from complex document types like PDFs and HTML.

  • Implement semantic similarity and metadata extraction using vector databases.

  • Build a RAG system to dynamically interact with your preprocessed data.

Details to know

Shareable certificate

Add to your LinkedIn profile

Recently updated!

February 2025

Assessments

7 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Placeholder
 logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal
Placeholder
Coursera Career Certificate

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

Placeholder
Coursera Career Certificate

There are 8 modules in this course

In this module, we will introduce you to the course, highlighting its goals, the skills and knowledge you'll need to succeed, and how the content is organized to guide you through the process of preparing unstructured data for large language models (LLMs) and retrieval-augmented generation (RAG) systems.

What's included

2 videos1 reading

In this module, we will guide you through setting up the necessary development environment, including creating and configuring API accounts, integrating the Unstructured framework, and performing a test run to ensure everything is operational before proceeding with data preprocessing tasks.

What's included

4 videos1 assignment

In this module, we will explore the intricacies of data preprocessing for LLMs, delving into the challenges posed by unstructured data and the techniques required to overcome them. You'll learn about the entire workflow—from cleaning and normalizing data to structuring and chunking it—culminating in a comprehensive overview of the Unstructured framework.

What's included

6 videos1 assignment

In this module, we will dive into hands-on exercises using the Unstructured framework to preprocess different document types. You'll explore the steps involved in extracting and normalizing data from PDFs, PPTX files, and HTML, and discover how these processes improve data quality for downstream use cases in LLMs and RAG systems.

What's included

4 videos1 assignment

In this module, we will focus on chunking and metadata extraction, exploring how to segment document content into logical units and enrich it with metadata for advanced applications like semantic similarity and hybrid search. Through hands-on activities, you’ll learn how to optimize document processing workflows, structure document elements effectively, and integrate results into a vector database.

What's included

8 videos1 assignment

In this module, we will tackle the challenges of preprocessing complex documents, including PDFs and images, by leveraging advanced tools like DLD and ViT. You’ll explore hands-on methods for extracting and summarizing table content, gain insights into preprocessing HTML and PDF files efficiently, and evaluate the trade-offs between different preprocessing techniques.

What's included

7 videos1 assignment

In this module, we will synthesize the skills and techniques learned throughout the course to build a complete RAG system. From preprocessing and structuring complex documents to creating a searchable database and enabling conversational interactions with your documents, you’ll gain hands-on experience in deploying an end-to-end solution tailored for real-world applications.

What's included

6 videos1 assignment

In this module, we will conclude the course by revisiting the major milestones and skills acquired. You’ll receive guidance on applying your knowledge to real-world scenarios and discover resources to continue your journey in advanced data preprocessing and RAG system development.

What's included

1 video1 assignment

Instructor

Packt - Course Instructors
Packt
567 Courses50,814 learners

Offered by

Packt

Recommended if you're interested in Data Management

Why people choose Coursera for their career

Felipe M.
Learner since 2018
"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."
Jennifer J.
Learner since 2020
"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."
Larry W.
Learner since 2021
"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."
Chaitanya A.
"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."
Placeholder
Coursera Plus

Open new doors with Coursera Plus

Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions