AI is facing a scarcity of online content for consumption. While individuals access the internet for various purposes, companies leverage this data to enhance their large language models (LLMs) and expand their capabilities. Platforms like ChatGPT not only store factual data but also utilize it to formulate responses effectively, drawing from a vast repository of online information.
However, the challenge arises as companies encounter the finite nature of the internet. The demand for continuous growth in AI development poses a dilemma, particularly as high-quality data becomes scarce and some entities restrict access to their data. OpenAI and Google, as highlighted by the Wall Street Journal, are among the organizations grappling with this predicament.
The Insatiable Appetite for Data
The volume of data required by these companies, both presently and in the future, should not be underestimated. According to Epoch researcher Pablo Villalobos, OpenAI trained GPT-4 using approximately 12 million tokens, which represent words and word segments comprehensible to the LLM. Villalobos suggests that for GPT-5, OpenAI’s upcoming model, an estimated 60 to 100 trillion tokens would be necessary to sustain its growth trajectory. This translates to 45 to 75 trillion words, based on OpenAI’s metrics. Even after exhausting all available high-quality internet data, an additional 10 to 20 trillion tokens, or more, would still be required.
While Villalobos predicts that the data scarcity issue may not reach critical levels until around 2028, AI companies are already exploring alternative data sources for model training.
Addressing the Data Dilemma
Several challenges come into play concerning this data shortage. Firstly, the scarcity of data is a fundamental obstacle since LLM training heavily relies on it. Secondly, the quality of the data is paramount. Companies are cautious about incorporating all internet content into their models to avoid misinformation and inaccuracies. Filtering out subpar content reduces the available options for training.
Moreover, the ethics of data scraping from the internet raise concerns about user privacy. AI companies often utilize scraped data without regard for individual privacy rights. While some entities are taking action against this practice, the lack of comprehensive user protections allows public data to be utilized for AI training purposes.
As companies seek new data sources, OpenAI stands at the forefront of innovation. For GPT-5, the company is exploring the use of public video transcriptions, such as those obtained from platforms like YouTube. Additionally, OpenAI is developing specialized models for specific niches and devising a system to compensate data providers based on data quality.
Embracing Synthetic Data
One controversial approach under consideration by some companies involves utilizing synthetic data for model training. Synthetic data is essentially generated from existing datasets to create a new dataset resembling the original but distinct. While this method aims to preserve data confidentiality, there are concerns about “model collapse.” Training LLMs on synthetic data risks stagnation as the models may struggle to evolve and provide diverse responses.
Despite the challenges, AI companies remain optimistic about synthetic data integration. Both Anthropic and OpenAI see potential in this technology for their training datasets, albeit with caution. Finding a balance in implementing synthetic data without hindering model growth is crucial for the future of AI development. It is essential to navigate this landscape thoughtfully to ensure data privacy and model effectiveness.