In the latter part of 2021, OpenAI encountered a supply challenge. The renowned artificial intelligence laboratory had fully utilized all accessible sources of credible English text on the web during the development of its latest A.I. system. To enhance the training of its upcoming technology, OpenAI required a substantial amount of additional data.
To address this need, OpenAI researchers devised a speech recognition tool named Whisper. This innovative tool was capable of transcribing audio content from YouTube videos, generating fresh conversational text to augment the intelligence of the A.I. system.
During internal discussions at OpenAI, concerns were raised about the potential violation of YouTube’s regulations by utilizing this approach, as disclosed by individuals familiar with the matter. YouTube, a subsidiary of Google, explicitly prohibits the utilization of its video content for applications deemed “independent” of the platform.
Subsequently, a dedicated team at OpenAI transcribed over one million hours of YouTube videos, with Greg Brockman, the President of OpenAI, reportedly actively involved in the video collection process. The transcribed texts were then integrated into GPT-4, a highly advanced A.I. model widely recognized as one of the most potent globally, forming the foundation of the latest iteration of the ChatGPT chatbot.
The pursuit of leadership in the realm of artificial intelligence has intensified into a frantic quest for the digital data essential to propel technological advancements. Tech giants such as OpenAI, Google, and Meta have resorted to expedient measures, including circumventing corporate policies and contemplating legal ambiguities, as revealed in an investigation by The New York Times.
At Meta, the parent company of Facebook and Instagram, internal discussions among managers, legal experts, and engineers reportedly revolved around the potential acquisition of the publishing firm Simon & Schuster to access extensive literary works, based on recordings of internal meetings obtained by The Times. Furthermore, there were deliberations on the extraction of copyrighted data from various online sources, even if it entailed legal repercussions. The rationale behind these actions was the perceived time-consuming nature of negotiating licenses with publishers, artists, musicians, and the news industry.
Prior to 2020, the majority of A.I. models operated with relatively limited training data. However, a significant shift occurred following the release of Mr. Kaplan’s influential paper in 2020, ushering in a new era epitomized by GPT-3, a sophisticated language model. Researchers began incorporating substantially larger volumes of data into their models during this transformative period.
The modifications to Google’s privacy policy in the previous year for its free consumer applications aimed to enhance user experience and develop innovative products, features, and technologies that serve the public interest. By leveraging publicly available information, Google refines its language AI models and creates cutting-edge products like Google Translate, Bard, and Cloud AI capabilities.
Source: Google
By The New York Times