Synthetic Data Explained

“Synthetic Data is information that’s artificially manufactured rather than generated by real-world events.”

Synthetic Data is created algorithmically, and it is used as a stand-in for test datasets of production or operational data, to validate mathematical models and, increasingly, to train Machine Learning models.

According to Gartner, “by 2024, 60% of the data used for the development of AI and Analytics projects will be synthetically generated.”

What Is Synthetic Data

Any data generated by a computer simulation and not collected from real-world signals is known as synthetic data. Usually, an algorithm is trained on real data and is able to reproduce the same statistical properties of that dataset in a new, synthetic sample.

A classic example of synthetic data can be found in flight simulator games that attempt to mimic real-life events within a controlled environment.

Importantly, synthetic data can make new advances in AI possible and can enhance the decision-making process in businesses.

Benefits Of Synthetic Data

The development of cutting edge, successful models in AI and ML nowadays requires larger and larger volumes of high-quality data. Synthetic Data is used across many different industries and can be particularly useful in these cases:

Preserving privacy by generating a synthetic dataset without any sensitive information. This is particularly useful in the healthcare and financial sectors.
Build complex models that require large amounts of data that are either too expensive or too time-consuming to collect (such as with self-driving cars, or other computer vision applications).
Researchers are able to explore and test new algorithms under controlled conditions with Synthetic Data.

Challenges With Synthetic Data

Naturally, there are a number of challenges that come with generating and using Synthetic Data. I’m sure as more research is conducted in the area of synthetic data, these challenges will be easier to overcome in the future.

Synthetic Data is only as good as the underlying model used to generate it. This would be a classic example of garbage in, garbage out. Any biases found in the original data will carry through into the Synthetic Data.
Generating Synthetic Data is a very complex, time-consuming process and requires highly skilled individuals to build the algorithms that generate the data.
Depending on the application of Synthetic Data, there could be some serious ramifications if the models that are built on Synthetic Data go wrong.
Business users or researchers may not be very trusting of Synthetic Data since it is still a very new area.
If data is generated using MLmodels then overfitting can lead to Synthetic Data that does not generalise well to real-world scenarios.

Techniques For Generating Synthetic Data

Random sample generators: Scikit-learn has a library for generating datasets at various sizes and complexities.
Fitting to a known distribution: Monte Carlo method.
Decision tree ML models.
Generative Adversarial Network (GAN) – American Express used GAN’s to generate Synthetic Data that help make their fraud detection models more accurate.
Domain randomisation – altering images in a way that improves Neural Network Models (such as changing the size, lighting, or colours in an image).

Synthetic Data generation is being offered by more and more companies. Most notably is MIT’s Synthetic Data Vault which is an open source software ecosystem for generating Synthetic Data.

Synthetic Data Use Cases

Listed are the capabilities and most common use cases of Synthetic Data in different industries and departments/business units.

Data Sharing: Innovation in many sectors relies on partnering with third-party organizations such as fintechs or medtechs. Synthetic Data enables enterprises to evaluate third-party vendors and share private data with them without security or compliance risks.
Data privacy regulations not only restrict data sharing between organizations but also prevent the flow of data within an organization. Getting data access permissions can take weeks which can hinder collaboration. Organizations can speed up innovation with enhanced collaboration between teams by leveraging Synthetic Data.
Cloud migration: Cloud services offer a range of innovative products for many sectors. However, moving private data to cloud infrastructures involves security and compliance risks. In some cases, moving synthetic versions of sensitive data to the cloud can enable organizations to take advantage of the benefits of cloud services.
Regulations also limit how long a business can store personal data. This is a problem for long-term analyses such as detecting the seasonality of data over several years. Synthetic Data provides a way to comply with data retention regulations without undermining long-term analytics capabilities.
Fraud identification is a major part of any financial service, but fraudulent transactions are rare. With synthetic fraud data, new fraud detection methods can be tested and evaluated for their effectiveness.
Customer analytics: Synthetic customer transaction data can be used to perform analysis on customer data to understand customer behavior. This is similar to the use case on “internal data sharing” however it is applicable more widely in finance where most customer data is private.
Healthcare analytics: Synthetic Data enables healthcare data professionals to allow the internal and external use of record data while still maintaining patient confidentiality. This is similar to the use case on “internal data sharing” however it is applicable more widely in healthcare where most customer data is private.
For Software Testing and Quality Assurance, artificially generated data is often the better choice as it eliminates the need to wait for ‘real’ data. Often referred to under this circumstance as ‘test data’. This can ultimately lead to decreased test time and increased flexibility and agility during development.
Testing content filtering systems: Social networks are fighting fake news, online harassment, and political propaganda from foreign governments. Testing with Synthetic Data ensures that the content filters are flexible and can deal with novel attacks.

Marketing: Synthetic Data allows marketing units to run detailed, individual-level simulations to improve their marketing spend. Such simulations would not be allowed without user consent due to GDPR. However Synthetic Data, which follows the properties of real data, can be reliably used in simulation.

Final Thoughts

Synthetic Data is a great tool for testing and exploration and future research might improve the generation algorithms in a manner that validity and accuracy of the data becomes less of a problem or at least more transparent.

🅐🅚🅖