Microsoft published a research paper this week that highlights a new AI model called VASA-1 that can transform a single photo and audio clip of a person into a realistic lip-sync video — with facial expressions, movements of the head and all.
The AI model was trained on AI-generated images from generators such as DALL·E-3, which the researchers then overlaid with audio clips. The results are images turned into videos of talking faces.
Researchers built technology from competitors such as runway AND Nvidiabut it says in the letter that their method of doing things is better, more realistic and “significantly superior” to existing methods.
Connected: Adobe's Firefly image generator was partially trained on AI images by Midjourney
The researchers said the model can take audio of any length and generate a talking face to match the clip.
The only non-AI image the researchers experimented with was the Mona Lisa. They made the iconic image lip sync to Anne Hathaway”Paparazzi“, which begins with the lines “No, I'm a paparazzi, I don't play no yahtzee.”
A screenshot of the video in the middle of the frame. Credit: Entrepreneur
The Mona Lisa was an example of a photograph that the AI model was not trained on – but could still manipulate. The model could also transform artistic photos, receive song audio, and handle speech in languages other than English.
The researchers noted that the model could work in real time with a demo video showing the model instantly animating images with head movements and facial expressions.
Deepfakes, or digitally altered media of a person that can spread misinformation or receive someone's likeness without permissionare a danger posed by advanced AI that can generate digital media with relatively few reference points.
Connected: Tennessee Passes Law to Protect Musicians from AI Deepfakes
Microsoft addressed this concern generally in the paper, with the researchers stating, “We oppose any behavior to create misleading or harmful content of real people and are interested in applying our technique to advance counterfeit detection.”
The researchers stated that their technique also had potentially positive applications, such as improving access and increasing educational efforts.
Google demonstrated one similar research project last month, featuring an AI capable of taking a photo and creating a video from it that the user can then control with their voice. The artificial intelligence was able to add head movements, blinks and hand gestures.