A Pocket Picasso
MobileDiffusion, a mobile-friendly text-to-image model with a novel architecture, reduces inference times for rapid design iteration.
Text-to-image generative AI models represent a groundbreaking advancement in the field of artificial intelligence, offering the capability to transform textual descriptions into visually compelling images. These models, driven by powerful neural networks, have found diverse applications across various domains. One of the primary uses is in creative content generation, enabling artists, designers, and content creators to translate their written concepts into vibrant visual representations.
One notable class of text-to-image generative models is the diffusion-based models, with Stable Diffusion being among the most popular. These models leverage diffusion processes to generate high-quality images by sequentially applying a series of transformations to a noise vector. The results often exhibit impressive realism and detail, making them particularly appealing for artistic endeavors, conceptual design, and storytelling.
Despite their remarkable capabilities, diffusion-based models face a significant drawback due to their sheer size and computational demands. Operating these models requires powerful and expensive computer systems, creating barriers for many creators who may lack access to such resources. The limitations become particularly evident when attempting to run these models on mobile platforms, where the computational load can be overwhelming, leading to slow performance or, in some cases, rendering them impossible to deploy.
This computational bottleneck poses challenges for the iterative nature of the creative process, hindering the quick exploration and refinement of ideas on more accessible platforms. A small group of engineers at Google Research have been working on a solution to this problem called MobileDiffusion. It is an efficient latent diffusion model that was purpose-built for use on mobile platforms. On higher-end smartphones, MobileDiffusion is capable of producing high-quality 512 x 512 pixel images in about half a second.
Traditionally, diffusion models are slowed down by two primary factors — their complex architectures, and the fact that the model must be evaluated multiple times for the iterative denoising process that generates the images. The Google Research team did a deep dive of Stable Diffusion’s UNet architecture to look for opportunities to reduce these sources of slowness. When this research was complete, they designed MobileDiffusion with a text encoder, a custom diffusion UNet, and an image decoder. The model only contains 520 million parameters, which is suitable for use with mobile devices like smartphones.
The transformer blocks of UNets have a self-attention layer that is extremely computationally intensive. Since these transformers are typically spread throughout the entire UNet, they contribute significantly to lengthy run times. In this case, the researchers borrowed an idea from the UViT architecture and concentrated the transformer blocks at the bottleneck of the UNet. Because of the reduced dimensionality of data at that stage of processing, the attention mechanism is less resource-intensive.
It was also discovered that the convolution blocks that are distributed throughout the UNet hog a lot of computational resources. These blocks are essential for feature extraction and information flow, so they must be retained, but the researchers found that it was possible to replace the regular convolution layers with lightweight separable convolution layers. This modification maintained high levels of performance, but also reduced computational complexity.
The team similarly improved the model’s image decoder and made a number of other enhancements to further improve mobile performance. The result of these optimizations proved to be very impressive. MobileDiffusion was compared with Stable Diffusion on an iPhone 15 Pro, and it was demonstrated that inference times were reduced from almost eight seconds to less than one second. These speeds allow for generated images to be continually updated in real-time as a user types, and updates, their text prompt. This could be a major boon to creative content developers.