AI training with synthetic data

Artificial Intelligence processor unit. Powerful Quantum AI component on PCB motherboard with data transfers.
image: ©da-kuk | iStock

Generative AI models like OpenAI’s GPT-4 and Stability AI’s Stable Diffusion have become incredibly adept at creating text, code, images, and videos – but what can synthetic data do to improve these services

Training these models requires huge amounts of data, and developers are already facing shortages. This has led to interest in using synthetic data, which is cheaper, limitless, and poses fewer privacy risks. But, recent research suggests this approach might come with significant downsides.

A study by the Digital Signal Processing group at Rice University, presented at the International Conference on Learning Representations (ICLR) in May, highlights the potential risks of relying heavily on synthetic data for training AI. The study, titled “Self-Consuming Generative Models Go MAD,” found that continuous use of synthetic data can lead to what they call “Model Autophagy Disorder” (MAD), drawing an analogy to mad cow disease. Richard Baraniuk, Rice’s C. Sidney Burrus Professor of Electrical and Computer Engineering, explained that just like mad cow disease, which spread through cows eating the remains of their peers, AI models trained on synthetic data from previous generations can become corrupted over time.

Synthetic data

The research explored three scenarios: fully synthetic loops, synthetic augmentation loops, and fresh data loops. In fully synthetic loops, models are trained exclusively on synthetic data from previous generations. Synthetic augmentation loops mix synthetic data with a fixed set of real data. Fresh data loops combine synthetic data with new real data each generation.

Findings showed that without sufficient fresh real data, models produced increasingly distorted outputs, lacking in quality and diversity. For instance, images of human faces started showing grid-like scars, and images of numbers turned into unreadable scribbles.

The quality of Internet data

Baraniuk and his team also considered “cherry picking,” where users select higher-quality data at the expense of diversity. While this can preserve data quality over more generations, it leads to a quicker decline in diversity.

The study warns of a “doomsday scenario” where, if left unchecked, MAD could degrade the quality and diversity of the internet’s data. Even in the near term, unintended consequences from AI autophagy seem inevitable.

This research, supported by the National Science Foundation and other agencies, underscores the importance of incorporating fresh real data in training AI models to avoid the pitfalls of over-reliance on synthetic data.

While synthetic data offers an engaging solution to data shortages, it has its own risks. Ensuring a steady supply of fresh real data is crucial to maintaining the health and performance of future AI models.

LEAVE A REPLY

Please enter your comment!
Please enter your name here