To enhance the capabilities of artificial intelligence systems, technology companies rely on online data to fuel their advancements. Here are the key points to consider:
Online data has served as a valuable asset for an extended period. Companies like Meta and Google have leveraged data for targeted online advertising, while platforms like Netflix and Spotify utilize it to provide personalized recommendations for movies and music. Even political campaigns have utilized data to identify specific voter demographics to focus their efforts on.
In recent times, the significance of digital data in the evolution of artificial intelligence has become increasingly evident. The following points shed light on this crucial aspect:
Importance of Data Quantity
The effectiveness of artificial intelligence is directly linked to the volume of data it processes. This is because AI models achieve higher accuracy and exhibit more human-like behavior when trained on extensive datasets.
Similar to how a student gains knowledge by consuming a vast array of books and information, large language models—such as those underpinning chatbots—improve in accuracy and capability when exposed to substantial data inputs.
For instance, prominent language models like OpenAI’s GPT-3, introduced in 2020, underwent training on hundreds of billions of “tokens,” which essentially represent words or word fragments. Recent iterations of large language models have been trained on over three trillion tokens.
Data Sources for GPT-3
OpenAI’s groundbreaking AI model was trained on a diverse range of sources, including billions of websites, books, and Wikipedia articles sourced from the internet. However, OpenAI has not disclosed the specific datasets used for training its recent models.
Key datasets involved in training GPT-3 include:
- Common Crawl: Text extracted from web pages accumulated since 2007.
- Wikipedia: Consisting of 3 billion tokens sourced from English-language Wikipedia pages.
- Books 1 and Books 2: Content details undisclosed by OpenAI, believed to encompass text from millions of published books.
- WebText2: Web pages linked from Reddit that garnered three or more upvotes, indicating user approval.
These datasets collectively contributed to the training of GPT-3, enriching its language processing capabilities and overall performance.
Source: OpenAI, The New York Times
Kindly note that we are currently experiencing difficulties in retrieving the full article content. Your understanding and patience are greatly appreciated.