Rise of the Machines
Despite advances, robots struggle in the real world due to limitations of their control systems. Gemini Robotics AI models aim to fix this.
Today’s robots do not have a lot of problems when it comes to agility and dexterity. Advances in actuators, sensors, manipulators, and so on, have given them the ability to do nearly anything that a human can do. As far as the hardware components are concerned, the old stereotypes of robots as big, lumbering, clumsy machines no longer apply. We can see this in countless demonstrations of humanoid robots that do gymnastics or household chores with an impressive range of motion, balance, and fine motor skills that blur the lines between machine and human capabilities.
But when preprogrammed demonstrations end, and the same robots are thrust into the real world, everything changes. When in an unfamiliar environment, they are likely to live up to the old stereotypes once again. Why the quick change? The problem is no longer related to their hardware, it is now a software issue — in particular, their control algorithms. The real world is a messy place, and adapting to diverse, unstructured environments is a problem that has not yet been cracked.
If only we had algorithms that exhibited a remarkable understanding of the world. Hey, wait a minute! What about large language models, like Llama, GPT-4, and Gemini? Since many of these models now also understand visual information, they may be the perfect tools for a new generation of robot control systems, says Google DeepMind. Fresh off the release of a suite of new Gemma models for use in resource-constrained environments, they have announced the creation of a pair of Gemini 2.0 models that are specifically tailored to the needs of robots.
The two new AI models, Gemini Robotics and Gemini Robotics-ER, aim to bridge the gap between AI reasoning and real-world robotic control. These models build on Gemini 2.0, Google DeepMind’s latest multimodal AI system, by adding physical action as an output.
Gemini Robotics is a vision-language-action model, meaning it takes in text, images, and video as input and translates them into physical actions for robots. It allows robots to perform an expanded range of real-world tasks, even those they have not been explicitly trained for. Robots using Gemini Robotics can handle new objects, instructions, and environments without requiring extensive retraining. Furthermore, the model understands and responds to natural language commands, adapting on the fly to changes in its surroundings. And finally, Gemini Robotics enables robots to manipulate objects with greater precision, tackling complex tasks like folding origami or packing delicate items.
The second model, Gemini Robotics-ER (Embodied Reasoning), is designed for roboticists who want more control over their systems. This model excels at spatial reasoning, helping robots identify objects in 3D space, estimate distances, and plan complex movements. Unlike previous AI models, Gemini Robotics-ER can generate entire control sequences autonomously, reducing the need for human-coded instructions.
The models have been tested on a variety of robotic platforms, including ALOHA 2, the widely used Franka arm, and the humanoid Apollo robot by Apptronik. This flexibility suggests that Gemini Robotics could power a broad range of robots, from industrial arms to humanoid assistants.
While most of the hardware required for agile and dexterous robots is already in place, these new Gemini models could provide the missing software intelligence to make robots truly adaptive, interactive, and useful in the real world.