Synthetic Data Sets: Data Generation for Machine Learning

Written by Coursera Staff • Updated on

Synthetic data makes data more accessible and provides the training materials you need to create machine learning algorithms. Explore why synthetic data sets are important and synthetic data use cases like medical research and autonomous vehicles.

[Feature Image] A learner works with synthetic data sets on their laptop as part of their coursework.

A synthetic data set is a set of data created by artificial intelligence (AI) or machine learning (ML) that retains all of the defining properties of the original data set but isn't real. This is a useful resource in situations where it would be difficult to create more data or when privacy and ethical concerns limit how many people can access a data set. 

For example, imagine you want to create a machine learning algorithm that can help keep your software development project secure by identifying potentially fraudulent activities and flagging them for review. To train such a model, you would need data about what normal, not-fraudulent activities look like and examples of what fraud could look like. You could create this training material using synthetic data that contains the same identifying properties of real data without waiting for actual fraudulent activities to occur.  

Explore why synthetic data sets are important and how you can use them in a variety of applications and industries. 

What is a synthetic data set, and why are they important? 

A synthetic data set is artificially created data that you can use in place of real data to train machine learning models, conduct scientific research, develop software, and more. Synthetic data can help you gain insight into the properties and underlying mechanisms of data in situations where it would be challenging to create an authentic data set. For example, medical research trials rely heavily on sensitive patient data, which presents a possible privacy risk. Researchers could create a synthetic data set using the original, sensitive data and end up with a data set that many people can access and work with without putting personal information at risk. 

Synthetic data also creates equity around data by providing more people with access to data sets. Companies and organizations restrict access to their data for many reasons, privacy and the sheer value of the data being two big reasons. Researchers can more easily share synthetic data, allowing many more people and organizations access to it. Kalyan Veeramachaneni, principal research scientist at MIT, compared the opportunities that synthetic data can create for students and individuals early in their careers to the advances in access to computing power and resources in the last 20 years. Veeramachaneni recalled the difficulty he had in graduate school accessing the computing power that he needed for his work, which today’s graduate students can easily access through cloud computing services. “If I hadn’t had access to data sets the way I had in the last 10 years, I wouldn’t have a career,” Veeramachaneni said [1]. Synthetic data can open these opportunities for more and more upcoming research scientists. 

Methods of generating synthetic data sets

You can generate synthetic data with traditional data analysis, but you can also apply machine learning and deep learning to a real data set to create a valuable set of synthetic data. 

  • Statistical distribution: Using this method, data scientists create statistical models using actual data, which they can then use as the basis for creating synthetic data without losing the important properties of the data. 

  • Model-based: Instead of analyzing the data using data analysis, scientists can deploy machine learning algorithms to complete this analysis instead. With deep learning, you can use a variety of models such as generative adversarial networks (GANs), variational autoencoders (VAEs), and large language models to first understand what characteristics define the data and second to generate synthetic or fake data that have fidelity to the original data. 

Synthetic data set use cases

You can use synthetic data for two primary purposes: to supplement situations where it is difficult or impossible to obtain more real data and to protect privacy in data sets with sensitive information. Explore different scenarios where you might use synthetic data instead of real data. 

Difficult or impossible to obtain real data

You can encounter many situations where it would be difficult, impossible, or unethical to collect the amount of real data you need to accomplish your task. One example is crash data for autonomous vehicles. Training a model capable of controlling a vehicle would require you to provide data to the model so it can understand the complex relationships between the objects it sees and how it should react as a result. We can improve these models by giving them data about crashes and accidents so they can understand why those accidents occur and correct their behavior to avoid those accidents in the future. 

However, scientists are limited by the amount of data they can collect through accidents in the real world. Using synthetic data, researchers can give the model training materials that have the underlying patterns and principles of real crash data without requiring actual humans to crash their cars. 

Similarly, you can apply these concepts to software testing, where you might want more data about security breaches or potentially fraudulent transactions so you can train a model to mitigate these events. Synthetic data allows you to create the needed data without risking your development project. 

You can use synthetic data to train machine learning and AI models in many different situations above and beyond computer vision and software testing. In addition to helping you access data you wouldn’t be able to before, you can also control synthetic data to allow you to get specific types of additional data. Returning to the autonomous vehicles example, you could use synthetic vision to create more images in low lighting or darkness to help train the model for these scenarios. 

Sensitive data with privacy or security concerns

The second main reason you may use synthetic data is to address privacy or security concerns inherent in a data set. For example, scientists and researchers often need sensitive data for health care or medical research. Researchers can gain a lot of insight by analyzing patient records, how patients react to medications during clinical trials, or by looking at medical imagery. 

Another example of using synthetic data in place of sensitive data is The Global Synthetic Dataset, a project in collaboration between The Counter-Trafficking Data Collaborative and Microsoft Research. This is a synthetic data set that researchers and organizations can use to study global trafficking patterns in an attempt to develop evidence-based practices to fight human trafficking. By understanding the patterns within this data set, community-based organizations can gain insight into how they might best approach this problem and work to prevent it in their community without risking the private and sensitive information about victims of human trafficking. 

Both difficult and sensitive

You can also use synthetic data for both of these reasons, such as by using it to train a machine learning algorithm to identify medical images that contain potentially cancerous tumors. In this case, you would need a lot of potentially sensitive data to train the algorithm. Synthetic data answers the problem of creating enough data to effectively train your model without putting real patient information at risk. 

Learn more about synthetic data sets on Coursera

Synthetic data is an important tool that can help you train machine learning algorithms with sensitive data safely. If you want to learn more about working with synthetic data sets or how to create your own, consider a program on Coursera to help you gain the skills you need to apply synthetic data sets in your career. For example, you could enroll in the IBM Machine Learning Professional Certificate to help you learn about working with machine learning for data analysis or IBM’s AI Engineering Professional Certificate, which can help you gain a solid foundation in machine learning, deep learning, and more. 

Article sources

  1. MIT Sloan. “What Is Synthetic Data and How Can it Help You Competitively? , https://mitsloan.mit.edu/ideas-made-to-matter/what-synthetic-data-and-how-can-it-help-you-competitively.” Accessed February 27, 2025. 

Keep reading

Updated on
Written by:
Coursera Staff

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.