A New Kind of Noise-Canceling AI

Diffusion Forcing combines the best of next-token and full-sequence prediction models, making it valuable in robotics and video generation.

Experimenting with Diffusion Forcing in a robot control system (📷: Mike Grimmett / MIT CSAIL)

Many of the breakthrough artificial intelligence (AI) applications that have emerged in the past few years owe their success to a broad category of algorithms called sequence models. The algorithms that underpin popular large language models like Llama, ChatGPT, and Gemini belong to a particular category of sequence models that perform next-token (or word) prediction. Text-to-video tools, such as Sora, are also based on sequence models, but in these cases the models used can predict the full sequence of a result, not just the next token.

Traditionally, sequence models built for next-token prediction can generate sequences of variable lengths but struggle with long-term planning. On the other hand, full-sequence models excel at long-term planning but are limited to fixed-length input and output sequences. This leaves both classes of models with their own set of trade-offs, each leaving something different to be desired.

An overview of the method (📷: B. Chen et al.)

Researchers at MIT CSAIL and the Technical University of Munich want to have their cake and eat it too, so they developed a new approach called Diffusion Forcing. This technique combines the strengths of both approaches to improve both the quality and adaptability of sequence models.

At its core, Diffusion Forcing builds on "Teacher Forcing," which simplifies sequence generation into smaller, manageable steps by predicting one token at a time. Diffusion Forcing introduces the concept of "fractional masking," where noise is added to the data in varying amounts, mimicking the process of partially obscuring or masking tokens. The model is then trained to remove this noise and predict the next few tokens, allowing it to simultaneously handle denoising and future predictions. This method makes the model highly adaptable to tasks involving noisy or incomplete data, enabling it to generate precise, stable outputs.

The researchers validated the Diffusion Forcing technique through a series of experiments in robotics and video generation. In one experiment, the team applied the method to a robotic arm tasked with swapping two toy fruits across three circular mats. Despite visual distractions like a shopping bag obstructing its view, the robotic arm successfully completed the task, demonstrating Diffusion Forcing’s ability to filter out noisy data and make reliable decisions.

In another set of experiments, Diffusion Forcing was tested in video generation, where it was trained on gameplay footage from Minecraft and simulated environments in Google’s DeepMind Lab. Compared to traditional diffusion models and next-token models, Diffusion Forcing produced higher-resolution and more stable videos from single frames, even outperforming baselines that struggled to maintain coherence beyond 72 frames.

Last but not least, in a maze-solving task, the method generated faster and more accurate plans than six baseline models, demonstrating its potential for long-horizon tasks like motion planning in robotics.

Diffusion Forcing has been shown to provide a flexible framework for both long-term planning and variable-length sequence generation, making it valuable in diverse fields such as robotics, video generation, and AI planning. The technique's ability to handle uncertainty and adapt to new inputs could ultimately lead to advancements in how robots learn and perform complex tasks in unpredictable environments.

nickbild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Latest Articles