Hackster is hosting Hackster Holidays, Ep. 7: Livestream & Giveaway Drawing. Watch previous episodes or stream live on Friday!Stream Hackster Holidays, Ep. 7 on Friday!

Filling in the Blanks

The MAGE framework unifies image generation and recognition, leading to synergies and efficiencies in these tasks not previously possible.

Nick Bild
1 year ago β€’ Machine Learning & AI
MAGE can fill in the gaps to complete an image (πŸ“·: Alex Shipps/MIT CSAIL via Midjourney)

Image generation and image recognition are two fundamental tasks in the field of artificial intelligence (AI) that have witnessed significant advancements in recent years. These tasks are crucial for a wide range of applications and have become integral components of many AI systems.

Image generation, also known as generative modeling, involves the creation of new images by AI systems. With the advent of deep learning techniques, particularly generative adversarial networks and variational autoencoders, the ability to generate highly realistic and coherent images has reached impressive levels. The progress in image generation has led to applications such as realistic image synthesis, artwork creation, and data augmentation for training other AI models.

On the other hand, image recognition focuses on the identification and classification of objects, scenes, or patterns within images. Convolutional neural networks, in particular, have revolutionized image recognition, enabling machines to achieve human-level or even superhuman-level performance in certain tasks. Image recognition plays a crucial role in diverse applications such as autonomous vehicles, medical imaging diagnosis, security systems, and content-based image retrieval.

Despite the fact that image generation and image recognition tasks are very closely related, to date the methods used to accomplish each have been very different and have remained distinct. Training these models separately has had the effect of causing them to miss out on the potential synergies that could arise from the tasks learning from one another. A great deal of overhead is also introduced in the form of model training and maintenance.

For the first time, a framework has been created to unify image generation and representation learning. Called the Masked Generative Encoder (MAGE), this system developed by researchers at MIT's Computer Science and Artificial Intelligence Laboratory can fill in missing parts of an image by leveraging two broad functionalities β€” identifying images and generating new ones that are photorealistic.

Training of MAGE begins by dividing the images up into semantic tokens, each representing a small area of pixels. This avoids the need for calculating a pixel-level reconstruction loss, which has previously been shown to lead to blurry results of poor quality. Next, a portion of the semantic tokens are masked, effectively rendering them invisible to the model. By masking a large portion of the image (e.g. 75%), then working to recreate the original image, MAGE learns to generate high-quality reconstructions. By leaving the image unmasked, the same model can learn an encoding that is useful for image recognition.

After being trained on a large, unlabeled dataset, MAGE excelled at few-shot image classification. Tests on the ImageNet dataset revealed that this new framework correctly classified image classes in 71.9% of cases after being shown only ten examples. Moreover, when MAGE was provided with previously unseen images that were 75% masked, it produced some very impressive results that looked very close to the original, unmasked images.

While MAGE has shown a lot of promise already, the researchers acknowledge that their tokenization process does lead to a loss of information that can negatively impact the performance of the system. They are presently exploring alternative means of image compression in the hopes of overcoming this limitation in the future. The team is also planning to train MAGE on larger datasets moving forward to see how much they can boost accuracy levels.

This work has the potential to lead to many new advancements in the field of computer vision. To help that process along, the researchers have made their source code available on GitHub.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles