Synthetic Data Is Transforming AI

“Synthetic Data is poised to upend the entire value chain and technology stack for Artificial Intelligence, with immense economic implications.”

AI is facing several critical challenges. Not only does it need huge amounts of data to deliver accurate results, but it also needs to be able to ensure that data isn’t biased, and it needs to comply with increasingly restrictive data privacy regulations. We have seen several solutions proposed over the last couple of years to address these challenges — including various tools designed to identify and reduce bias, tools that anonymize user data, and programs to ensure that data is only collected with user consent. But each of these solutions is facing challenges of its own.

Imagine if it were possible to produce infinite amounts of the world’s most valuable resource, cheaply and quickly. What dramatic economic transformations and opportunities would result?

Now we’re seeing a new industry emerge that promises to be a saving grace: Synthetic Data.

Synthetic Data is not a new idea, but it is now approaching a critical inflection point in terms of real-world impact. It is poised to upend the entire value chain and technology stack for AI with immense economic implications. It is artificial computer-generated data that can stand-in for data obtained from the real world.

A synthetic dataset must have the same mathematical and statistical properties as the real-world dataset it is replacing but does not explicitly represent real individuals. Think of this as a digital mirror of real-world data that is statistically reflective of that world. This enables training AI systems in a completely virtual realm. And it can be readily customized for a variety of use cases ranging from healthcare to retail, finance, transportation, and agriculture.

The Trouble With Real Data

Over the last few years, there has been increasing concern about how inherent biases in datasets can unwittingly lead to AI algorithms that perpetuate systemic discrimination. In fact, Gartner predicts that through 2022, 85% of AI projects will deliver erroneous outcomes due to bias in data, algorithms, or the teams responsible for managing them.

The proliferation of AI algorithms has also led to growing concerns over data privacy. In turn, this has led to stronger consumer data privacy and protection laws in the EU with GDPR, as well as U.S. jurisdictions including California and Virginia.

These laws give consumers more control over their personal data. For example, the Virginia law grants consumers the right to access, correct, delete, and obtain a copy of personal data as well as to opt out of the sale of personal data and to deny algorithmic access to personal data for the purposes of targeted advertising or profiling of the consumer.

By restricting access to this information, a certain amount of individual protection is gained but at the cost of the algorithm’s effectiveness. The more data an AI algorithm can train on, the more accurate and effective the results will be. Without access to ample data, the upsides of AI, such as assisting with medical diagnoses and drug research, could also be limited.

One alternative often used to offset privacy concerns is anonymization. Personal data, for example, can be anonymized by masking or eliminating identifying characteristics such as removing names and credit card numbers from ecommerce transactions or removing identifying content from healthcare records.

But there is growing evidence that even if data has been anonymized from one source, it can be correlated with consumer datasets exposed from security breaches. In fact, by combining data from multiple sources, it is possible to form a surprisingly clear picture of our identities even if there has been a degree of anonymization. In some instances, this can even be done by correlating data from public sources, without a nefarious security hack.

Synthetic Data’s Solution

Synthetic Data promises to deliver the advantages of AI without the downsides. Not only does it take our real personal data out of the equation, but a general goal for synthetic data is to perform better than real-world data by correcting bias that is often engrained in the real world.

Although ideal for applications that use personal data, synthetic information has other use cases, too. One example is complex Computer Vision modeling where many factors interact in real time. Synthetic video datasets leveraging advanced gaming engines can be created with hyper-realistic imagery to portray all the possible eventualities in an autonomous driving scenario, whereas trying to shoot photos or videos of the real world to capture all these events would be impractical, maybe impossible, and likely dangerous. These synthetic datasets can dramatically speed up and improve training of autonomous driving systems.

Perhaps ironically, one of the primary tools for building Synthetic Data is the same one used to create deepfake videos. Both make use of Generative Adversarial Networks (GAN), a pair of Neural Networks. One network generates the Synthetic Data and the second tries to detect if it is real. This is operated in a loop, with the generator network improving the quality of the data until the discriminator cannot tell the difference between real and synthetic.

Synthetic Data And AI

AI-generated Synthetic Data is set to revolutionize how we share, use and build datasets. The first Synthetic Data sets generated by AI were images. Today, synthetic images are an important part of training computer vision algorithms. The next frontier of the Synthetic Data revolution is taking place in the field of tabular or structured Synthetic Data. Companies, governments and researchers using traditional data anonymization techniques, like data masking, where sensitive parts of the data are simply masked or encrypted, have to live with the so-called privacy-utility trade off. The privacy-utility trade off means that the more you anonymize, the less useful the data becomes. AI-generated Synthetic Data offers a great alternative where privacy is preserved without data utility loss.

As Synthetic Data becomes increasingly pervasive in the months and years ahead, it will have a disruptive impact across industries. It will transform the economics of data.

The net effect of the rise of Synthetic Data will be to empower a whole new generation of AI upstarts and unleash a wave of AI innovation by lowering the data barriers to building AI-first products.

Final Thoughts

Synthetic data technology will reshape the world of AI in the years ahead, scrambling competitive landscapes and redefining technology stacks. It will turbocharge the spread of AI across society by democratizing access to data. It will serve as a key catalyst for our AI-driven future. Data-savvy individuals, teams and organizations should take heed.

🅐🅚🅖