HiP, HiP, Hooray

MIT's HiP system simplifies robot training using separate models for language, vision, and action, enhancing performance in household tasks.

1 year ago • Robotics

HiP helps robots to carry out complex tasks (📷: Alex Shipps / MIT CSAIL)

Everyday household chores can be so simple for us, that we hardly give them a thought. But for developers of robots, even very basic activities can quickly become insurmountable problems. Consider making a pot of coffee, for example. For a human, that is as simple as grabbing a filter, measuring out some coffee and water, and pressing a button. But for a robot, it must first navigate its environment, locate the coffee filter, and delicately pick it up. It must then locate the coffee maker and understand how to interact with it to insert the filter in the correct location.

Even this labored explanation of the first step does not do the problem justice. Each piece of this explanation would be broken into a dozen subproblems, each requiring sensory input, motor control, and real-time decision-making to complete. When you think about it, it is a wonder that a robot can do anything at all in an unstructured environment like a typical home.

The approach leverages multiple foundation models (📷: A. Ajay et al.)

There is still a long way to go before general-purpose robots can give us a hand with arbitrary tasks around the household, but rapid progress is being made. Some of the most promising solutions leverage a foundation model that was trained on massive datasets, consisting of language, vision, and action data. However, it has proved to be exceedingly time consuming and expensive to collect the data needed to train these models. This is because the language, vision, and action data must all be paired to train a monolithic model — and finding existing data that checks all of those boxes is very challenging.

To address this issue, a team at MIT’s Improbable AI Lab put forth a new solution. Rather than attempting to train a monolithic model, they instead developed a hierarchical approach, called Compositional Foundation Models for Hierarchical Planning (HiP), in which separate models are used for each data modality. In this way, the language model only needs to be trained on text, and the vision model only on images, and so on. This feature makes it easy to locate existing datasets, or to less expensively assemble new ones, to train each model individually.

Execution of HiP begins with a large language model, which interprets a user request and leverages the vast body of knowledge that it was trained on to break it up into sub-tasks. The rough plan developed is then supplemented with a large video diffusion model. This model incorporates knowledge about the physical environment around the robot, which is useful in developing an observation trajectory plan. Finally, an egocentric action model is put to work to determine which specific actions the robot should take, given the request and the state of the environment that it is in. Together, these models enable a robot to carry out all of the required sub-tasks in unstructured, real-world environments.

The hierarchical planning process (📷: A. Ajay et al.)

The researchers conducted a number of experiments to assess the performance of their system. In one such trial, a robot was asked to stack colored blocks in a particular order. All of the needed colors were not available, so the robot would have to dunk white blocks in paint to accommodate those cases. HiP was found to be up to the task, accurately stacking the blocks, and adding color where needed. Throughout the course of all of the experiments, the team found HiP to regularly outperform existing monolithic model-based systems.

This approach is fascinating for its ability to complete complex tasks without requiring massive data collection efforts. Moreover, the separation of models makes the reasoning process more transparent and explainable. Looking ahead, the team hopes to incorporate other modalities, like touch and sound, into the process to develop even more capable robots.

machine learning

artificial intelligence

robotics

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

HiP, HiP, Hooray

MIT's HiP system simplifies robot training using separate models for language, vision, and action, enhancing performance in household tasks.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles