Google Announces VideoPoet, its First AI Video Generation Model

Have you ever wondered what it would be like to watch a video of a dragon breathing fire, a skeleton drinking soda, or a raccoon traveling the world? Well, now you can, thanks to VideoPoet, a new artificial intelligence (AI) model developed by Google Research.

VideoPoet is a large language model (LLM), a type of AI usually used for generating text and code. However, what’s impressive and unusual is that VideoPoet can also generate high-quality videos from various inputs, such as text, images, and audio. What’s even more remarkable is that it can edit and stylize existing videos and produce matching audio for them.

How does VideoPoet work?

VideoPoet is based on a simple idea: convert any autoregressive language model or LLM into a video generator. To do this, VideoPoet uses multiple tokenizers, algorithms that transform images, video, and audio clips into sequences of discrete tokens, and vice versa.

VideoPoet trains an LLM to learn across video, image, audio, and text modalities, using the MAGVIT V2 tokenizer for video and image, and the SoundStream tokenizer for audio. Once the model generates tokens conditioned on some context, these can be converted back into a viewable representation with the tokenizer decoders.

What can VideoPoet do?

VideoPoet is capable of a wide variety of video generation tasks, including:

  • Text-to-video: VideoPoet can output high-motion variable-length videos given a text prompt.
  • Image-to-video: VideoPoet can animate input images to produce motion.
  • Video stylization: VideoPoet can take in a video representing the depth and optical flow, which represent the motion, and paint contents on top to produce the text-guided style.
  • Video inpainting and outpainting: VideoPoet can edit (an optionally cropped or masked) video for inpainting or outpainting, such as filling in missing regions or extending the video frame.
  • Video-to-audio: It can also output audio to match an input video without using text as guidance.

Why is VideoPoet important?

VideoPoet is a breakthrough in video generation, as it demonstrates that LLMs can generate high-quality videos with coherent large motions, which is challenging for existing models. VideoPoet also showcases the versatility and scalability of LLMs, as it can integrate many video generation capabilities within a single model, rather than relying on separately trained components that specialize in each task.

VideoPoet opens up new possibilities for creating and editing videos, as well as for telling visual stories. For example, Google Research has produced a short movie composed of many short clips generated by VideoPoet, based on a script written by another Google LLM, Bard. The movie tells the story of a traveling raccoon named Rookie, and you can watch it on YouTube.

Where can I learn more about VideoPoet?

Want to see VideoPoet in action? Google Research’s blog showcases fascinating examples of videos created by the model.

This blog post by VideoPoet researchers sheds light on the technical details of VideoPoet.

Showcase

Here are some example outputs of VideoPoet:

From text input

“A Raccoon dancing in Times Square”

Source: Google Research

From image to video

Source: Google Research

Video stylization with text input

Source: Google Research

Audio generation

Source: Google Research

Where can I access VideoPoet?

Sadly, Google has just published a paper about the model, explaining how the model works, and some use case examples/outputs generated by VideoPoet. So, while they have created a model whose rival doesn’t currently exist, they haven’t said a thing about when the model will be publicly available.

However, based on the current AI scenario by keeping Google’s history in mind, we can certainly guess when the model will be released.

The current AI race is wild, to say the least. Having seen the world completely changed by AI in only one year (in 2023,) we can safely assume that it’s only going to be more insane. So it’s only a matter of time before OpenAI, Microsoft, or a startup like Runway ML comes up with an even better model. That’s why Google may want to be the first to do so.

Google already has very ambitious plans for its Gemini model, which is inherently multitasking, as told by Google. Since it’s not launched yet, we can’t say that VideoPoet isn’t or hasn’t become a part of it, as VideoPoet is also an LLM, just like Gemini. But it’s a shot in the semi-dark.

But even if VideoPoet is a standalone model, Google may want to release it in 2024. I think the only delay now will be due to fine-tuning and safety-checking.

Leave a Reply

Your email address will not be published. Required fields are marked *