Mon. Dec 23rd, 2024
What Happens When You Train An Ai With The Data

In the world of artificial intelligence (AI) and large-scale language models (LLM), finding the right training data is a core requirement for building generative solutions. As the capabilities of generative AI models such as Chat GPT, DALL-E, etc. continue to grow, the temptation to use AI-generated output as training data for new AI systems increases. However, recent research has shown the dangerous effects of doing this, leading to a phenomenon called “model collapse.” In a study published in July 2023, scientists from Rice University and Stanford University concluded that it is not a good idea to train AI models solely on the output of generative AI. They titled their report “The Frenzy of Self-Consuming Generative Models.”

Whenever you train an AI model based on data generated by other AI models, the AI ​​model is essentially learning from a distorted reflection of itself. Much like the game of “Telephone,” the data generated by the AI ​​becomes corrupted and disconnected from reality with each iteration. The researchers say that introducing relatively small amounts of AI-generated content into training data can be “detrimental” to the model, causing its output to quickly degrade to gibberish after just a few training cycles. I discovered something. This is because the errors and biases inherent in synthetic data are amplified as the model learns from its own generated output.

Model collapse issues have been observed in many types of AI models, from language models to image generators. Larger, more powerful models may be slightly more resistant, but there is little evidence that they are immune to this issue. As AI-generated content proliferates on the internet and within standard training datasets, future AI models are likely to be trained on a mixture of real and synthetic data. This can create a “self-eating” or self-consuming loop, where the quality and diversity of the model’s output steadily declines over successive generations.

Researchers at Rice University and Stanford University conducted a thorough analysis of self-consumed generative image models in which the models were trained on their own synthetic output. They identified three main types of self-consuming loops:

  • Fully synthesized loop: In these loops, the model is trained only on synthetic data generated by previous models. The researchers believe that these complete synthesis loops inevitably lead to model autophagy failure (MAD), where the quality (precision) or diversity (recall) of the images produced gradually declines with each generation. I discovered that. For example, training was done on her two identical facial image generators in a fully synthetic loop. One was done with a “sampling” bias that increases synthesis quality at the expense of diversity, and the other without. Without bias, the generated images will have wavy artifacts and reduce realism (quality). Because of the bias, the images remained high quality, but became less and less diverse, eventually converging on a small number of nearly identical faces.
  • Synthesis expansion loop: These loops incorporate a fixed set of real training data along with synthetic data. Researchers have found that this allows him to delay, but not prevent, the onset of MAD. Real data initially improves performance, but eventually synthetic data dominates, leading to lower quality and diversity.
  • Fresh data loop: In these loops, each generation of the model has access to a new set of real training data that it has never seen before. The researchers found that this prevents MAD and maintains both the quality and diversity of the images produced over generations. The key factor is whether enough fresh real-world data is available at each generation. Without sufficient new real-world data, self-consuming generative models are doomed to suffer from his MAD, resulting in a gradual decline in the quality and variety of their output. In summary, this case study shows that self-consuming generative models can fall victim to model autophagy failure, leading to synthetic output degradation over time unless they have stable access to new real-world training data. It shows that.

Recently, prominent figures in the AI ​​industry pledged at the White House to implement strategies such as watermarks to distinguish synthetic data from real data. The proposed watermarking approach involves embedding technical markers within synthetic content such as deepfake images and audio. This watermark is intended to help users identify that the content is artificially generated rather than capturing real-world events. These efforts ultimately aim to address the negative effects of synthetic data on the Internet. Regarding model autophagy disorder (MAD), watermarking may serve as a preventative measure to prevent generative models from being trained on AI-generated data. Nevertheless, the effectiveness of such approaches in addressing MADness remains to be determined and requires further investigation.

The researchers also emphasize that it is critical to maintain a representative balance of real and synthetic content in the training data while adequately preserving underrepresented groups. Companies must carefully manage their datasets and monitor them for signs of deterioration. Training data should be diverse and representative of different perspectives, and special efforts should be made to incorporate data sources that are typically underrepresented in digital environments. Otherwise, we risk a future in which AI systems become increasingly disconnected from reality, with outputs that are biased, unreliable, and meaningless. This can have serious implications in many areas, from content generation to decision-making systems. It is true that as humans, we consume a wide range of AI-generated things in our lives, but as humans, we probably have coping mechanisms that AI systems do not have.

The lessons learned from this study echo past cautionary tales, such as the spread of radioactive fallout contaminating newly produced steel. Just as we needed to be careful about the purity of our materials, we need to be equally careful about the purity of our AI training data. Through responsible data curation and monitoring, we hope to guide the development of AI in a way that stays grounded and serves the diverse needs of all communities. The other option is a dystopian future in which AI tools become increasingly “crazy” and no longer fit for purpose.

About the author

Ranjita Bhattacharya He is a senior data scientist in the AI ​​Hub division at BNY Mellon, the world’s largest custodian bank. My experience as a data science/technology consultant spans over 15 years and I have worked in multifaceted technical roles in positions such as IT software developer, solution designer, technical analyst, delivery manager, and project manager. I have fulfilled my goal. We consult for Fortune 500 companies around the world. I have a bachelor’s degree in computer science and engineering, a master’s degree in data science, and multiple certifications and publications in these fields. It shows my commitment to continuous learning and knowledge sharing.

Sign up for free at insideBIGDATA Newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideBIGDATANOW