### Collaborating with Businesses to Develop New AI Education Datasets: OpenAI’s Initiative

The acknowledgment that the datasets utilized for instructing AI models are fundamentally flawed is widely recognized.

Due to the prevalence of Eastern images on the internet during the creation of these datasets, image collections tend to exhibit a U.S. and Western-centric bias. Moreover, problematic language and prejudices are embedded in the data used to train large language models like Meta’s Llama 2, as recently highlighted in a study by the Allen Institute for AI.

These imperfections are further exacerbated by iterations. OpenAI has expressed its intention to combat these issues by collaborating with external entities to develop new, ideally improved datasets.

OpenAI recently unveiled Data Partnerships, an endeavor aimed at collaborating with external organizations to construct public and private datasets for AI training purposes. According to a blog post from OpenAI, Data Partnerships will “empower more companies to steer the direction of AI” and “derive value from more actionable insights.”

The ultimate goal, as outlined by OpenAI, is for AI models to possess a comprehensive understanding of all subject areas, industries, cultures, and languages to ultimately create AI that is both safe and beneficial for all of society. This necessitates the utilization of the most expansive training datasets possible. OpenAI states, “Enhancing AI models’ comprehension of your domain through the inclusion of your content can render them more beneficial to you.”

Under the Data Partnerships initiative, OpenAI plans to amass “extensive” datasets that “mirror the real world” and are presently challenging to access online. While the company intends to engage with various modalities such as images, audio, and video, it specifically seeks data that captures “human intent” (e.g., long-form text or dialogues) across diverse languages, subjects, and formats.

OpenAI has indicated that it will collaborate with companies to incorporate training data as needed, leveraging tools like optical character recognition and automated speech recognition while ensuring the removal of sensitive or private information.

Initially, OpenAI aims to generate two distinct types of datasets: a general dataset for AI model training and two sets of customized AI models undergoing private training. OpenAI has already partnered with the Icelandic Government and Mieind ehf to enhance GPT-4’s proficiency in the Icelandic language, and the Free Law Project has bolstered its models’ comprehension of legal documents. The private datasets cater to companies seeking to safeguard their data while enhancing OpenAI’s models’ understanding of their specific domain.

To maximize the benefits for all stakeholders, OpenAI is seeking partners to aid in educating AI about our world.

Does OpenAI surpass the multitude of dataset creation initiatives that preceded it? I am somewhat skeptical; several experts worldwide have expressed bewilderment over the challenge of mitigating dataset biases. At the very least, transparency and honesty regarding the processes involved and the inevitable challenges in dataset creation are crucial.

Despite the lofty rhetoric in the blog post, there seems to be a distinct commercial motive to enhance OpenAI’s models’ performance at the expense of others, without any form of compensation for the data originators. While this may fall within OpenAI’s purview, considering the grievances and legal actions taken by creators alleging unauthorized use of their work by OpenAI, some concerns remain.

Last modified: December 26, 2023
