Magma, Scheduled for Release Next Week, Aims to Deliver a Foundation Model for Multi-Modal AI Agents

Developed by Microsoft Research and others, Magma can propose software UI interactions β€” and even the movement of a robot arm.

A team of researchers from Microsoft Research, the University of Maryland, the University of Wisconsin-Madison, the Korea Advanced Institute of Science and Technology (KAIST), and the University of Washington have released what they claim to be the world's first foundation machine learning model that can formulate plans and execute actions towards its goal β€” with a view to delivering artificially intelligent agents for robotics and more.

"Magma is the first foundation model that is capable of interpreting and grounding multimodal inputs within its environment," claims co-first author and project lead Jianwei Yang of the team's work. "Given a described goal, Magma is able to formulate plans and execute actions to achieve it. Magma should not be limited to either the digital world (e.g., web navigation) or the physical world (e.g., robotics manipulation), but rather be able to work across both worlds, just like humans ourselves."

Magma, a new foundation model, is claimed to deliver multimodal agentic operation for real-world robots and more. (πŸ“Ή: Jianwei Yang)

The idea behind Magma is to deliver a foundation model that can take current artificial intelligence technology from simply describing how to do something to actually doing it β€” expanding work in vision language models (VLMs) to allow it to plan and act out a course of action in the real world, taking both visual and spatial considerations into account.

The researchers tested the model in three key scenarios. The first is its multimodal understanding, or its ability to analyze text and visual inputs β€” delivering, the team claims, improved performance over existing models, including the ability to predict the subject's next actions in an ongoing video. The next was the ability to navigate the user interface of unfamiliar software to carry out a task on behalf of a user, such as booking a hotel stay. The final scenario was to extend the model's reach into the real world by putting it in direct control of a six degrees of freedom (6DoF) robot arm.

The model's high performance in each test is down to two key ways for it to analyze the world, reflected in its training data: set-of-mark (SoM), which gives clickable user interface elements or objects and the robot arm itself numeric marks within an image space; and trace-of-mark (ToM), which traces and predicts the movement of marks in an ongoing video β€” with, the researchers say, fewer tokens that traditional next-frame prediction while providing a longer prediction window.

The model can also interpret and interpret software user interfaces, for on-device agentic operation. (πŸ“Ή: Jianwei Yang)

The team has pledged to release the model inference code, checkpoints, pre-training code, and pre-training data on February 25th, under the permissive MIT license, though the release comes with caveats: "It is important to note that the model is specifically designed for UI navigation in a controlled Web UI and Android simulator, and robotic manipulation tasks and should not be broadly applied to other task," the researchers advise.

"Researchers should make sure that a human is in the loop and in control for every action the agentic system generates. Since the model cannot act by itself, the sub-module a researcher uses to actually perform the UI navigation action should ensure that no unintended consequences can occur as a result of performing the UI action proposed by the model."

More information on Magma, including a link to a preprint of the team's paper on Cornell's arXiv server, is available on the project website; the promised source code is to be published to GitHub, and the models to Hugging Face, next week.

Gareth Halfacree
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles