Openai is like that defendant by Many A party that trains AI with copyrighted content. Now new paper AI Watchdog organizations make serious accusations of increasingly relying on private books that have not been licensed to train more sophisticated AI models.
AI models are inherently complex prediction engines. He is trained on many data, including books, films, and TV shows. They learn patterns and novel ways to extrapolate from simple prompts. When models “write” essays on Greek tragedy and Ghibli-style images, they simply draw and approximate from their vast knowledge. It hasn’t reached anything new.
Many AI labs, including Openai, have begun to employ data generated by AI to train AI when ejecting real-world sources (mainly public web), but few have eschewed the actual data entirely. This is because training purely synthetic data involves risks such as poor model performance.
A new paper from the AI Disclosures Project, a nonprofit co-founded by media tycoon Tim O’Reilly and economist Iran Strauss in 2024, led to the conclusion that Openai likely trained the GPT-4O model with a paywalled book from O’Reilly Media. (O’Reilly is CEO of O’Reilly Media.)
In ChatGPT, the GPT-4O is the default model. O’Reilly does not have a license agreement with Openai, the paper states.
“Openai’s more recent and capable model, GPT-4o, shows a strong recognition of Paywalled O’Reilly’s content compared to Openai’s previous model, GPT-3.5 Turbo,” the paper’s co-author wrote. “In contrast, the GPT-3.5 turbo shows a significant relative perception of the published O’Reilly book sample.”
This paper used a method called de-copIt was first introduced in academic papers in 2024 and is designed to detect copyrighted content in training data for language models. Also known as a “membership inference attack,” this method tests whether the model can reliably distinguish between the same textual paraphrase, AI-generated versions and human-written text. If possible, it suggests that the model may have prior text knowledge from the training data.
The paper’s co-authors, O’Reilly, Strauss, and AI researcher Sruly Rosenblat, say they investigated knowledge of GPT-4O, GPT-3.5 Turbo and other Openai models on the O’Reilly media book published before and after the training cutoff date. They used 13,962 paragraph excerpts from 34 O’Reilly’s book to estimate the probability that a particular excerpt was included in the model’s training data set.
According to the results of the paper, the GPT-4o “recognized” the contents of Paywall O’Reilly books much more than the older models of Openai, including the GPT-3.5 turbo. It said, like an improvement in the new model’s ability to grasp whether texts are human writing, even after considering potential confounding factors.
“GPT-4O [likely] The co-author wrote:
It’s not a smoking gun, and co-authors should be careful. They acknowledge that their experimental methods are not innocent and that Openai may have collected excerpts from paid books from users copying and pasting them into ChatGpt.
The co-authors were even muddy with water and did not evaluate Openai’s latest collection of models, including “inference” models such as GPT-4.5, O3-Mini and O1. These models may not be trained with the Paywalled O’Reilly’s data or may have been trained in less than the GPT-4O.
That being said, it is no secret that Openai, which uses copyrighted data to advocate for loose restrictions on model development, has been seeking high-quality training data for some time. The company has been doing it so far Hire journalists to fine-tune the output of your model. It’s a trend in the broader industry. AI companies employ domain experts such as science and physics These experts effectively provide knowledge to AI systems.
You should be aware that Openai will pay at least a portion of its training data. The company carries out license transactions with news publishers, social networks, Stock Media Library and more. Openai also offers opt-out mechanisms – Although it is incomplete – Copyright holders can flag content that prefers companies that do not use for training purposes.
Still, the O’Reilly paper is the least flattering, as Openai fights several suits on training data practices and handling of copyright laws in US courts.
Openai did not respond to requests for comment.