If Someone Told You to Jump off a Cliff, Would You Do It?
The intelligent androids of sci-fi are one step closer with this tool for building robots that respond to natural language in real-time.
Science fiction has been a fairly good predictor of technologies to come over the years. In fact, science fiction described television, bionic limbs, submarines, and even the moon landing long before they became realities. But one common theme in these stories is noticeably lagging far behind in the real world. Intelligent robots that can converse and interact with humans in real-time, like those that inhabit the fantasy worlds created by the minds of writers, are nowhere to be found. Whether or not you think that is a good thing depends on if you have been watching movies about the friendly and helpful C-3PO, or the cyborg assassin known as The Terminator.
But seriously, what is taking so long? The Terminator was powered by a 6502 microprocessor developed in the 1970s, after all. Surely with today’s technology we must be on the brink of a revolution in intelligent robotics. And yet, the most interaction that a typical person has with a robot on a daily basis is dodging the little gadget that vacuums their floors. These robots are very useful, to be sure, but having a conversation with one will only get you strange looks from your family. Fiction may be getting closer to reality, however, thanks to the recent work done by a team at Robotics at Google. They have developed a methodological framework to produce robots that can interact with humans in real-time through natural language instructions. They also created a real world system that implements their methods to demonstrate its capabilities.
Existing systems that seek to help robots understand, and respond to, natural language tend to rely on curated imitation learning datasets or reinforcement learning with complicated reward functions designed specifically for each task the robot is to learn. Building and training these systems is very labor intensive, so they do not offer much promise of scaling up to general purpose systems that can understand anything a user may happen to say. Moreover, as these systems grow larger to handle more scenarios, the processing requirements grow, which makes real-time interaction highly impractical.
A vocabulary is traditionally defined by selecting a set of skills in advance, then collecting a dataset that represents each one. To make it practical to define a very large set of skills, the engineers chose to use a hindsight language relabeling process in which long videos of the robot were created, then annotators watched them and identified and described as many behaviors as they could using natural language. A dataset of 87,000 unique natural language utterances, linked to visuo-linguo-motor skills in the robot, was created in this way.
The team implemented their methods using a tabletop robot arm that was provided with objects of various colors and shapes to interact with and manipulate at a user’s request. The dataset was then leveraged to train a transformer-based neural network policy architecture that was designed to translate video and text into a series of continuous actions for a robot to perform. The resulting model achieved an estimated 93.5% accuracy rate in mapping the specified 87,000 natural language strings to robotic skills in the real world. With the system finalized, it was demonstrated in many scenarios, with a human user giving natural language instructions such as “push the red star to the top center of the board." After an instruction was given, the robot arm would immediately spring into action to carry out the exact action that was requested.
A tabletop robotic arm is not exactly C-3PO, to be sure, but this early work shows a lot of promise for the future development of general purpose robots that can understand spoken natural language commands from humans. To hasten the arrival of that future, the Robotics at Google engineers have open-sourced their entire dataset, which is an order of magnitude larger than existing datasets.