The Blind Leading the Blind
Researchers found that LLMs trained only on text can generate images to train computer vision systems without any real-world pictures.
Artificial intelligence (AI) image generators like DALL-E 3, Midjourney, and Stable Diffusion are now well known for their ability to produce creative and realistic images from text-based prompts. These tools have proven themselves to be highly valuable in fields ranging from entertainment and marketing to education and scientific research. But building these advanced AI algorithms is still a huge challenge. They typically require vast amounts of annotated image data for training, and these types of datasets can be hard to come by and very time-consuming and expensive to compile manually.
Might there be another path forward that eliminates the need for all that image data? Perhaps there is. Large language models (LLMs) are another red-hot area of research in AI. These models have proven themselves to be incredibly adept at understanding natural language and producing human-like responses to questions. Such capabilities are acquired by being trained on a massive amount of text that gives them a deep understanding of the world.
That understanding often extends beyond natural language, so a team of researchers at MIT CSAIL recently asked whether or not an LLM's understanding of real-world objects might be sufficient to produce images, like existing text-to-image tools. To test that theory, they prompted an LLM to write a computer program that produces an image fitting their specifications. Somewhat surprisingly, their idea worked.
In spite of the fact that the LLM was never trained on any image data, it proved to be capable of generating some reasonably good images. And when the user continued prompting the model to ask for revisions, the images improved further. This shows that LLMs are able to form a sort of “mental picture” of real-world objects from being trained on a wide range of text that describes them in different ways.
This was an interesting finding on its own, but the researchers went on to show that it is more than just a high-tech parlor trick. They leveraged their technique to prompt an LLM to generate a wide range of images — from simple shapes to full scenes. These images were then used as a dataset to train a computer vision system. It was then demonstrated that this computer vision system was capable of recognizing objects in real photos. Not only was it capable of this, but it outperformed computer vision systems that were trained by other procedurally generated image datasets.
Before you switch to an LLM for text-to-image generation tasks, it is important to note that this early work produces clipart-style drawings, which are a far cry from the ultra-realistic images produced by state-of-the-art text-to-image generators. Significant additional enhancements will be needed to rival models trained on actual image data, if that ever proves to be possible at all.
As a next step, the team plans to look into additional tasks that LLMs may be suitable for. They also hope to enhance their present vision model by allowing the LLM to work directly with it, rather than only indirectly by using the generated images as training data.