Diffusion Models, Now in 3D!
MIT researchers coaxed 2D diffusion models into generating sharp, realistic 3D objects by refining the Score Distillation algorithm.
The advancement of artificial intelligence text-to-image generators, especially diffusion models, has been a huge boon to many creative pursuits, ranging from design to marketing. But now that these tools have entered mainstream use and have had lots of users banging on them for a while, their limitations are really starting to become apparent. The early problems with extra fingers or legs being drawn on people have largely been fixed, but a persistent issue remains β existing models have been built to produce 2D images, and they fail miserably when asked for 3D objects.
This shortcoming severely limits the applicability of these algorithms for use in areas like virtual reality, engineering design, and robotics, so a number of efforts have been undertaken to remedy it. Initial efforts took perhaps the most obvious route, which involved retraining existing models. But this is an extremely costly and time-consuming approach, and since there is limited 3D training data available, it was not hugely successful. As such, researchers turned their attention to making the most of what they already had.
One particularly promising approach recently developed, called Score Distillation, makes it possible to generate 3D images from standard 2D models. But the results often end up being too cartoonish or otherwise unrealistic for real-world use. While imperfect, this technique appeared to have a lot of potential, so a team led by researchers at MIT took a closer look at it to see if they could get it into shape. Spoiler alert: they did.
The stock Score Distillation algorithm iteratively refines a 3D object by using a diffusion model to generate a 2D image of it from a random camera angle. With each 2D image that is generated, the 3D object is refined. This process is repeated until the 3D object matches what was requested.
Diffusion models work by starting with random noise, then gradually performing a denoising process until the desired result emerges. In the case of Score Distillation, the denoising process involves a calculation that is too computationally expensive to execute, so a shortcut is taken that introduces a random component instead.
The team zeroed in on this portion of the Score Distillation algorithm and found that it is the culprit causing unrealistic 3D objects to be generated. Since the actual computation is too complex to perform, they looked for a way to approximate the result of the actual calculation. It would not be perfect, but if they could find a good approximation, they knew it would be much better than random noise.
If at first you don't succeed, try, try again. That was the mantra of the researchers as they tested one approximation technique after another. Ultimately, their efforts paid off when they landed on a good solution. Using their optimized version of the Score Distillation algorithm, they found that they could produce sharp and realistic 3D images β no model retraining or fine-tuning required.
Looking ahead, the team is planning to investigate how they could solve this problem even more effectively. They also intend to explore ways that their work might improve image editing techniques in the future.