Written by 9:45 am Big Tech companies, OpenAI

### Frantically Collecting Data: OpenAI and Anthropic Racing to Train AI Models

AI giants like OpenAI and Anthropic are scrambling to find enough reliable training data. That coul…

The quest for ample data to support AI models by prominent players like OpenAI and Anthropic intensifies.

As per reports from the Journal, the endeavor to secure top-tier training data for AI models poses a challenge for companies like OpenAI and Anthropic.

  • The availability of quality training data is dwindling for OpenAI, Anthropic, and other AI enterprises.
  • The fierce competition in the flourishing industry is likely to impede AI advancement as businesses strive to innovate.
  • To address the scarcity, businesses are contemplating the utilization of synthetic data for AI training.

The acquisition of reliable data stands as a critical asset for AI, compelling companies such as OpenAI and Anthropic to pursue it fervently. Amid the escalating competition within the rapidly expanding sector, the shortage of data may impede the development of expansive models that underpin their chatbot technologies.

Traditionally, OpenAI’s ChatGPT and its robotic counterparts undergo training on extensive datasets comprising academic papers, news articles, and web-scraped Wikipedia entries to generate human-like responses. The premise revolves around the notion that the precision and desirability of these models in producing accurate outputs directly correlate with the value and integrity of the data they rely on.

In light of this predicament, the scarcity could potentially hinder companies in enhancing the intelligence of their AI products. Pablo Villalobos, an AI analyst at research firm Epoch, conveyed to the Wall Street Journal that there is a likelihood exceeding 50% that the demand for high-quality data will surpass the availability of training materials by 2028.

So, what factors contribute to the apparent struggle of tech firms in sourcing dependable data?

Primarily, only a fraction of online data proves suitable for AI training purposes. A significant portion of publicly accessible web content comprises fragmented sentences and textual imperfections that can impede AI systems in generating coherent responses. The dearth of valuable data is exacerbated by the proliferation of AI-generated text online, which can taint a model with nonsensical content, resulting in what experts term as “model collapse.”

Furthermore, prominent news outlets, social media platforms, and other public data sources have imposed restrictions on access to their content for AI training due to concerns related to copyright, privacy, and equitable compensation. Additionally, individuals exhibit reluctance in sharing their private textual data, such as iMessage conversations, for training endeavors.

Consequently, companies are grappling with the exploration of novel data sources to enrich their toolsets. For instance, OpenAI purportedly contemplates training GPT-5, its most advanced model, on transcripts from YouTube videos, as per insider sources. Moreover, OpenAI has deliberated the establishment of a data marketplace where content providers can monetize valuable data for model training purposes.

Some companies are delving into the realm of synthetic data to enhance their models. Anthropic, for instance, has integrated internally generated data into its AI chatbot family, Claude, under the guidance of Jared Kaplan, the startup’s chief scientist. OpenAI is also exploring this avenue, as confirmed by a spokesperson.

Concerns regarding the quality of AI chatbots have been raised by users, underscoring the issue of data scarcity. Certain users interacting with GPT-4, OpenAI’s cutting-edge model, have encountered challenges in prompting the bot to adhere to instructions and respond adequately to queries. Similarly, Google’s Gemini model produced historically inaccurate depictions of US presidents, prompting a temporary suspension of the AI image generation feature.

In response to these hurdles, some companies are contemplating downsizing their AI models while exploring alternative strategies to enhance their efficacy. “I think we’re at the end of the era where it’s going to be these giant, giant models,” remarked Sam Altman, CEO of OpenAI.

Business Insider reached out for comments from OpenAI and Google prior to publication, but immediate responses were not forthcoming. Anthropic, on the other hand, opted not to provide a statement.

Axel Springer, the parent company of Business Insider, has a global agreement permitting OpenAI to train its models utilizing the reporting from its media brands.

Visited 19 times, 1 visit(s) today
Tags: , Last modified: April 2, 2024
Close Search Window
Close