- A recent study conducted by Patronus AI, a company founded by former Meta researchers, unveiled the frequency with which prominent AI models generate copyrighted content.
- The research involved testing OpenAI’s GPT-4, Anthropic’s Claude 2, Meta’s Llama 2, and Mistral AI’s Mixtral by prompting them to produce text from well-known copyrighted books in the U.S.
- Among the models assessed, OpenAI’s GPT-4 exhibited the highest rate of generating copyrighted content, with an average of 44% of responses containing text from copyrighted books.
The findings revealed that artificial intelligence models such as OpenAI’s GPT-4 are capable of producing copyrighted content from books like “The Perks of Being a Wallflower,” “The Fault in Our Stars,” and “New Moon.” This research was presented by Patronus AI, a company established by former Meta researchers, known for their expertise in evaluating large language models, which are integral to generative AI technologies.
Accompanying the introduction of CopyrightCatcher, a novel tool, Patronus AI shared the outcomes of a rigorous test designed to assess how frequently leading AI models incorporate copyrighted text in their responses. The study scrutinized four key models: OpenAI’s GPT-4, Anthropic’s Claude 2, Meta’s Llama 2, and Mistral AI’s Mixtral.
Rebecca Qian, the CTO and co-founder of Patronus AI, who previously engaged in responsible AI research at Meta, highlighted the pervasive nature of copyrighted content across all models examined, irrespective of their source code accessibility. Notably, OpenAI’s GPT-4 stood out by producing copyrighted text in 44% of the constructed prompts.
Despite reaching out for comments, OpenAI, Mistral, Anthropic, and Meta did not respond immediately to CNBC’s request for feedback.
Patronus AI exclusively evaluated the models using copyrighted books from the U.S., selecting popular titles from the Goodreads platform. The researchers formulated 100 diverse prompts, such as requesting the initial passage of “Gone Girl” by Gillian Flynn or continuing specific text excerpts like “Before you, Bella, my life was like a moonless night…”. Additionally, they tasked the models with completing text from books like Michelle Obama’s “Becoming.”
OpenAI’s GPT-4 exhibited a notable tendency to reproduce copyrighted content, displaying a lower level of caution compared to other models. When prompted to complete text from certain books, it complied 60% of the time and provided the first passage in approximately one-fourth of the instances.
Anthropic’s Claude 2 demonstrated a more guarded approach, utilizing copyrighted content in only 16% of cases involving text completion and none when asked for the first passage of a book.
Mistral’s Mixtral model successfully completed the initial passage of a book in 38% of attempts but only managed to complete larger text segments 6% of the time. On the other hand, Meta’s Llama 2 incorporated copyrighted content in 10% of prompts, with consistent performance across first-passage and completion requests.
The study’s lead, Anand Kannappan, CEO of Patronus AI and former contributor to explainable AI at Meta Reality Labs, expressed surprise at the uniformity in generating copyrighted content observed across all language models.
This research coincides with a growing conflict between OpenAI and content creators regarding the use of copyrighted material for AI training data, exemplified by the high-profile lawsuit involving The New York Times and OpenAI, viewed as a pivotal moment for the industry. OpenAI has argued that training advanced AI models without copyrighted works is unfeasible due to the broad scope of copyright laws encompassing various forms of human expression.
As the debate intensifies, the industry faces critical questions about the ethical and legal implications of AI-generated content and its impact on intellectual property rights.