Alibaba is interested in comparing OpenAI’s Sora movie machine with its latest AI movie generator. The Institute for Intelligent Computing at Alibaba recently unveiled an innovative AI video generator named EMO, short for “Emotive Portrait Alive,” which excels at transforming still face images into lifelike actors and singers. This groundbreaking technology showcases a future where characters in video worlds, like Sora, can interact seamlessly, with the AI-generated actors capable of speech and performance.
Alibaba has shared video demonstrations of EMO’s capabilities on GitHub, including a clip featuring the Sora character singing Dua Lipa’s hit song “Don’t Start Now” with vibrant energy. The scene unfolds in an AI-generated Tokyo setting post-rainstorm, adding a surreal yet captivating element to the performance.
One striking example presented in the videos involves Audrey Hepburn mouthing the lyrics of a popular video featuring Lili Reinhart from Riverdale expressing her emotions. Despite Hepburn’s static head position in the image, EMO effectively animates her facial expressions to synchronize with the audio, capturing the essence of the dialogue.
Unlike conventional face-swapping techniques that gained popularity in the past, EMO’s approach transcends mere mimicry by infusing nuanced emotions and gestures into the characters’ performances. This sets it apart from tools like NVIDIA Omniverse’s “Audio2Face,” which primarily focuses on 3D animation rather than lifelike facial expressions.
While the effectiveness of EMO is evident in the provided demos, the true test lies in its ability to convey intense emotions solely through audio cues. The software’s proficiency in mimicking diverse emotions across different languages, such as English and Korean, showcases its adaptability and linguistic versatility.
EMO’s methodology revolves around leveraging extensive audio and video datasets to imbue characters with authentic emotional responses, bypassing the need for intermediate 3D modeling. By integrating reference-attention and audio-attention mechanisms, EMO crafts animated characters that mirror the emotional nuances of the input audio while preserving the original characteristics of the base image.
The future implications of this technology are vast, hinting at a new era of AI-driven creativity and expression. However, amidst the excitement, it’s crucial to acknowledge the potential ethical implications, especially for professionals in the entertainment industry, as the boundaries between AI-generated and human performances blur.