One of the domains experiencing rapid advancement due to relational AI is word cloning, a process involving the replication of a person’s speech characteristics, encompassing pitch, tone, rhythm, mannerisms, and unique vocal nuances.
Meta Platforms, the parent company of Facebook, Instagram, WhatsApp, and Oculus VR, has introduced its own comprehensive word cloning system named Audiobox. Concurrently, enterprises like ElevenLabs have garnered substantial revenue by focusing on this innovative field.
Recently, researchers from the Facebook AI Research (FAIR) lab unveiled Audiobox on Meta’s platform. This system, described as a “novel foundational research model for audio generation,” is an evolution of their prior work known as Voicebox.
Audiobox, as detailed on its website, utilizes a blend of word inputs and natural language prompts to create voices and sound effects efficiently, catering to various applications such as music production.
Users can simply input the desired phrase for replication or an outline of the intended sound, allowing Audiobox to handle the replication process seamlessly. Additionally, individuals have the option to record and clone their own speech using Audiobox.
Meta has introduced a range of AI models under the Audiobox umbrella, designed for speech manipulation and generating diverse sounds and rhythmic effects like animal noises or ambient sounds. These models are all based on the self-supervised learning (SSL) framework of Audiobox.
Unlike supervised learning that relies on labeled data, SSL is a deep learning technique where AI algorithms generate labels for unlabeled data. The FAIR researchers emphasized the significance of SSL in their methodology, aiming to train the model using audio data without conventional supervision like transcripts or captions.
Although AI models like Audiobox heavily rely on human-generated data for training, the FAIR researchers utilized a vast dataset comprising 160K hours of speech, 20K hours of music, and 6K hours of sound samples.
The dataset encompasses various audio sources like audiobooks, podcasts, speeches, and real-world recordings in diverse languages and accents to ensure inclusivity and accuracy across different demographics.
While the origin of the data used remains unspecified in the research paper, concerns have been raised regarding potential copyright issues, prompting inquiries into the data’s sourcing and usage permissions.
Meta has provided interactive demonstrations showcasing Audiobox’s capabilities, allowing users to replicate their voices and experiment with creating new voices based on textual descriptions or user-recorded samples.
Furthermore, Meta restricts the use of Audiobox in certain states like Illinois and Texas due to regulatory constraints, limiting its current application for commercial purposes in those regions.
As AI technologies rapidly evolve, it is anticipated that Audiobox and similar innovations will undergo further developments, potentially leading to commercial availability and expanded usability in the near future.