Aya, a large language model (LLM) developed by Cohere for AI, an open-source nonprofit research facility established in 2022, has been unveiled today. Aya supports 101 languages, which is more than double the number of languages currently supported by open-source models.
The researchers have also released the Aya dataset, a corresponding collection of human annotations. This release is significant as it addresses the challenge of training on less common languages due to limited source material availability. Cohere claims that their engineers have found ways to improve model performance even with less training data.
Sara Hooker, the VP of research at Cohere and a co-founder of the AI company, described the Aya project, launched in January 2023, as a massive endeavor involving over 3000 partners from 119 countries worldwide.
In an interview with VentureBeat, Hooker referred to the data generated from the Aya project as “gold dust,” emphasizing its value in enhancing LLM training. With more than 513 million fine-tuned annotations, he highlighted the project’s scale and impact.
Cohere’s co-founder and CTO, Ivan Zhang, mentioned the release of human annotations across 100+ cultures to expand intelligence outreach beyond the English-speaking world. This initiative is seen as a significant achievement by the Cohere for AI team.
The potential of LLMs for diverse languages and cultures, often overlooked by existing models, is a focal point of the new concept and data introduced by Cohere. The Aya model, benchmarked against popular open-source bilingual models, outperforms them significantly on standard tests and extends coverage to over 50 previously unexplored languages.
Hooker stressed the rarity and importance of the Aya dataset, emphasizing its value in supporting models for multiple languages. The ability to tailor models to specific language subsets is deemed crucial for addressing global linguistic diversity and optimization needs.
Aleksa Gordic, the developer of YugoGPT, highlighted the significance of initiatives like Aya in advancing non-English language models. He emphasized the need for extensive and high-quality data sources to create top-tier LLMs and called for support from governments worldwide to preserve language and culture in the evolving digital landscape.
The datasets and the Aya model from Cohere for AI are now available on Hugging Face, facilitating further research and development in the field of language models.