Researchers at Amazon have developed the most extensive text-to-speech model to date, claiming that it demonstrates emergent qualities that enhance its ability to naturally articulate even complex sentences. This advancement could potentially propel the technology beyond the uncanny valley.
While the growth and enhancement of these models were anticipated, the researchers were particularly hopeful to witness a significant improvement in capabilities similar to what occurred with language models upon reaching a certain size threshold. Interestingly, once Language Models (LLMs) exceed a certain scale, they exhibit increased robustness and versatility, enabling them to perform tasks beyond their original training.
It is essential to clarify that these advancements do not imply the attainment of sentience by the models; rather, their performance in specific conversational AI tasks experiences a significant boost beyond a certain threshold. The team at Amazon AGI, with their focus on Artificial General Intelligence, anticipated a similar progression as text-to-speech models scaled up, and their research confirms this hypothesis.
The latest model, named Big Adaptive Streamable TTS with Emergent abilities, is abbreviated as BASE TTS. The largest iteration of the model incorporates 100,000 hours of publicly available speech data, predominantly in English (90%), with the remaining content in German, Dutch, and Spanish.
With 980 million parameters, BASE-large stands out as the most substantial model in this category. Additionally, the team trained 400M- and 150M-parameter models using 10,000 and 1,000 hours of audio, respectively, for comparative analysis. The objective was to identify the point at which emergent behaviors manifest by comparing models of varying sizes.
Surprisingly, the medium-sized model demonstrated the desired leap in capability sought by the team. While the improvement in ordinary speech quality was marginal (only a slight increase in ratings), the model exhibited a set of emergent abilities that were both observed and measured. The research paper highlights examples of challenging text that the model successfully processed:
- Compound nouns: The Beckhams decided to rent a charming stone-built quaint countryside holiday cottage.
- Emotions: “Oh my gosh! Are we really going to the Maldives? That’s unbelievable!” Jennie squealed, bouncing on her toes with uncontained glee.
- Foreign words: “Mr. Henry, renowned for his mise en place, orchestrated a seven-course meal, each dish a pièce de résistance.
- Paralinguistics: “Shh, Lucy, shhh, we mustn’t wake your baby brother,” Tom whispered, as they tiptoed past the nursery.
- Punctuations: She received an odd text from her brother: ’Emergency @ home; call ASAP! Mom & Dad are worried…#familymatters.’
- Questions: But the Brexit question remains: After all the trials and tribulations, will the ministers find the answers in time?
- Syntactic complexities: The movie that De Moya who was recently awarded the lifetime achievement award starred in 2022 was a box-office hit, despite the mixed reviews.
The authors note that these sentences were intentionally crafted to present challenging tasks that text-to-speech engines typically struggle with, such as parsing complex sentences, conveying emotions, pronouncing foreign words, and handling punctuations. Although BASE TTS encountered difficulties, it outperformed its counterparts, Tortoise and VALL-E.
The researchers have showcased several examples of the model effectively processing these challenging texts on a dedicated website. Despite the selective nature of these demonstrations, they underscore the model’s impressive capabilities. Additionally, the model’s architecture suggests that its size and training data volume are pivotal factors enabling it to tackle complex linguistic nuances.
It is crucial to emphasize that this model is still in the experimental phase and not intended for commercial use. Future research will focus on determining the threshold for emergent abilities and optimizing the training and deployment processes for such models.
Noteworthy is the model’s “streamable” nature, allowing it to generate speech in real-time at a relatively low bitrate, rather than producing entire sentences at once. The team has also explored packaging speech metadata, including emotional cues and prosody, into a separate, low-bandwidth stream to accompany the audio output.
The advancements in text-to-speech models forecast a significant breakthrough in 2024, with potential implications for various applications, particularly in accessibility. However, the researchers have refrained from releasing the model’s source code and related data due to concerns about potential misuse by malicious entities. Nevertheless, the eventual dissemination of this information seems inevitable.