Know how Microsoft's VASA-1 AI app brings still images to life with animation and sound

AI breakthrough by Microsoft Research Asia: converts images and audio into lifelike videos with facial expressions. VASA-1 model pioneers natural interaction with AI avatars.

20 Apr 2024 14:27 IST

Updated On 20 Apr 2024 14:38 IST

The world of technology is evolving rapidly, with AI leading the charge. In the past, AI was primarily used for generating text, but now it is venturing into video creation.

Advertisment

Recently, a team of AI researchers at Microsoft Research Asia developed an AI application that can transform still images of people and audio tracks into animated videos.

This groundbreaking application not only animates still images but also accurately portrays the people in the images speaking or singing, complete with appropriate facial expressions.

Advertisment

Below, you can see this video, which has become immensely popular after being shared on social media:

Microsoft just dropped VASA-1.
This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba
10 wild examples:
1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD
— Min Choi (@minchoi) April 18, 2024

The AI model behind this innovation, called 'VASA,' is designed to create lifelike talking faces of virtual characters from a single image and an audio clip.

The researchers behind VASA-1, the premier model, have focused on creating lip movements that synchronise seamlessly with the audio, as well as capturing a wide range of facial expressions and natural head movements to enhance authenticity and liveliness.

The key to this innovation lies in the facial dynamics and head movement generation model.

4. Out-of-distribution generalization - singing audios pic.twitter.com/h7BvTq4vAE
— Min Choi (@minchoi) April 18, 2024

The researchers have successfully developed a system that can generate high-quality videos at up to 40 frames per second with minimal latency.

The Microsoft team claims that their AI-generated videos not only synchronise lip movements with audio but also accurately convey a variety of facial expressions and natural head movements.

With VASA-1, static images can come to life, enabling them to talk, sing, and express emotions in perfect harmony with any audio track.

How does it work?

The development of VASA-1 involved extensive training of the AI system on a vast dataset, allowing it to learn and reproduce the nuances of human emotions and speech patterns.

Rendering these realistic animations typically takes about two minutes, thanks to the computational power of a desktop-grade Nvidia RTX 4090 GPU.

Although there is no specific release date mentioned in the research paper, the team believes that VASA-1 brings them closer to a future where AI avatars can engage in natural interactions.

Also Read | Beware of new Android malware 'Mamont' that poses as Google Chrome to loot

Advertisment