LLMs Are a Poet and They Didn't Even Know It
Google's VideoPoet is a versatile LLM-based video generation tool that demonstrates a new path forward for producing smooth video.
Machine learning models used in video generation have made significant strides in recent years, showcasing remarkable capabilities in creating realistic and diverse visual content. These models, often based on diffusion models, generative adversarial networks, and variational autoencoders, have proven successful in tasks such as video synthesis, style transfer, and even generating entirely new and plausible video sequences.
Despite their numerous successes, one persistent challenge with most present models is that they struggle to generate large motions in videos without introducing noticeable artifacts. Generating coherent and smooth movements across frames remains a complex task. This struggle is particularly evident when attempting to produce dynamic scenes or videos with complex interactions, where maintaining consistency and natural flow poses a considerable difficulty.
This limitation can lead to artifacts such as jittery or unrealistic transitions between frames, which impacts the overall quality and visual appeal of generated videos. Researchers and practitioners in the field of machine learning are actively exploring innovative techniques and architectures to address this challenge. Strategies such as incorporating attention mechanisms, refining training methodologies, and leveraging advanced optimization techniques are being explored to enhance the ability of models to capture and reproduce large-scale motions with higher fidelity.
In recent times, diffusion-based models have taken the most prominent position among video generation algorithms. But a team at Google Research observed that large language models (LLMs) have an excellent ability to learn across many types of input, like language, code, and audio. They reasoned that these capabilities might be well-suited for video generation applications. To test that theory out, they developed a video generation LLM called VideoPoet. This model is capable of text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio tasks. In a break from more common approaches, all of these abilities coexist in a single model.
VideoPoet utilizes an autoregressive language model that was trained on a dataset including video, image, audio, and text data. Since LLMs require inputs to be transformed into discrete tokens, which is not conducive to using video or audio, preexisting video and audio tokenizers were leveraged to make the appropriate translations. After the model produces a result, tokenizer decoders can then be used to turn it into viewable or audible content.
The system was benchmarked against other popular models, including Phenaki, VideoCrafter, and Show-1. A cohort of evaluators was asked to rate the results of these models across a diverse array of input prompts. The testers overwhelmingly preferred the results produced by VideoPoet in categories like text fidelity and motion interestingness. This indicates that the new model has successfully tackled some of the present issues with producing large motions in generated videos.
A demonstration of VideoPoet’s text-to-video generation capabilities was produced by the team by asking the Bard chatbot to write a detailed short story about a traveling raccoon, and turn each scene into a prompt for VideoPoet. These scenes were stitched together to generate the video below.
The work done by Google Research hints at the tremendous potential of LLMs to handle a wide range of video generation tasks. Hopefully other teams will continue exploring additional opportunities in this area to produce a new generation of even more powerful tools.
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.