Do the Massachusetts One-Step
DMD accelerates diffusion-based text-to-image generation models by up to 30 times, without sacrificing quality, by running in a single step.
At the intersection of natural language processing and computer vision, text-to-image AI models have exhibited a remarkable ability to generate realistic images from textual descriptions. Over the years, significant advancements in AI have propelled the development of increasingly sophisticated text-to-image models, like Stable Diffusion and DALL-E, that have a huge potential to enhance a variety of applications in areas ranging from creative content generation to e-commerce and entertainment.
One notable advancement in this field is the rise of diffusion models, which have captured a great deal of attention for their ability to generate high-quality images. Diffusion models operate by iteratively refining a noisy initial image until a clear and coherent image is produced. This iterative refinement process involves a large number of calculations, with each step aimed at improving the image's quality by adding structure and reducing noise. While effective in generating realistic images, this iterative approach is inherently slow due to the computational complexity involved.
The time-intensive nature of this process has been a significant bottleneck, limiting the scalability and practical applicability of diffusion models in real-time or large-scale image generation tasks. To address these limitations, researchers have been exploring innovative approaches to accelerate the generation process while maintaining the quality of the generated images. One promising solution developed by a team at MIT and Adobe Research aims to streamline the image generation process into a single step. Called Distribution Matching Distillation (DMD), this method leverages the knowledge contained in cutting-edge models like Stable Diffusion to train a simpler model to produce similar results all in one iteration.
DMD employs a teacher-student framework, where a simpler "student" model is trained to mimic the behavior of a more complex "teacher" model that generates images. In this case, the teacher model is Stable Diffusion v1.5.
The process operates through a combination of regression loss, which stabilizes training by anchoring the mapping process, and distribution matching loss, which ensures that the probability distribution of generated images matches that of real-world images. Diffusion models then act as guides during the training process, allowing the system to understand the differences between real and generated images and facilitating the training of the single-step generator.
In terms of performance, DMD shows promising results across various benchmarks. It accelerates diffusion models like Stable Diffusion and DALLE-3 by 30 times while maintaining or surpassing the quality of generated images. On ImageNet benchmarks, DMD achieves a super-close Fréchet inception distance score of just 0.3, indicating that high-quality and diverse images are being generated.
The researchers noted that when it comes to more complex text-to-image applications, there are still some issues with the quality of the generated images. There are also some additional issues that arise from the choice of the teacher model and its own limitations — the student cannot easily rise above the teacher. Looking ahead, the team is considering leveraging more advanced teacher models to overcome these issues.
Despite these limitations, the example results produced using the DMD approach are quite impressive. In the side-by-side comparisons, it is difficult to tell which were produced by DMD, and which by Stable Diffusion. But when actually generating the images, that 30 times speed-up would be unmistakable.