Mixing It Up with PoCo
MIT's PoCo combines diverse data sources using diffusion models to create multifunctional robots that outperform today's best systems.
A great many successes have been achieved in machine learning lately by going big in terms of the training datasets used. Perhaps the best known examples of this trend are in the area of large language models (LLMs), where the algorithms are sometimes trained on the text of essentially the entire internet. This massive scale has enabled these models to generate human-like text, understand context with remarkable accuracy, and perform a variety of language-related tasks such as translation, summarization, and sentiment analysis. The breadth and depth of their training data allow them to capture nuances and subtleties of human language that smaller models might miss.
These advances have caught the attention of researchers working in the world of robotics. In robotics, it is still generally the case that algorithms are trained to learn a narrow set of tasks from relatively small datasets. But the dream of roboticists has long been to build a general-purpose robot that can do virtually anything that is asked of it. Since that sounds an awful lot like LLMs and other foundation models that can perform a wide range of tasks, these researchers are increasingly exploring how to leverage these cutting-edge models in robotics.
Unfortunately, it is a more complicated matter to train a robot than many other systems. Since they must learn to interact with the physical world in complex ways, they require many types of data for knowledge acquisition. The sources of that data may include videos, teleoperation demonstrations, and measurements from a variety of sensors. Furthermore, the data may be acquired from simulations or real-world experiments.
All of this data cannot be tossed into a blender together, so to speak, to produce a multipurpose robot control system. But a team of researchers at MIT’s CSAIL is working to make something very much like that possible. Their approach, called Policy Composition (PoCo), leverages diffusion models, which are known for their capabilities in generative AI, to combine numerous sources of disparate data types for training robots. It was demonstrated that PoCo leads to substantial improvements in task performance when compared to existing techniques.
The team’s approach begins by initially training a separate diffusion model on each type of data that is available. These models slowly remove noise to predict a clear trajectory for a set of inputs during the training process. Then, via a weighted combination, these separate diffusion models are joined together into a single, unified model. This is accomplished through an iterative process that ensures that the performance of each individual model is maintained in the unified model.
It was found that by combining the models trained on different data types, the resulting system was better generalized, and capable of performing better than each model in isolation. A number of experiments involving the use of tools by a robotic arm showed that PoCo achieved 20 percent better performance than baseline methods. The researchers also noted that the architecture of PoCo allows for a mix-and-match approach in which specific types of models can be combined to meet the unique needs of challenging tasks.
Looking ahead, the team wants to train models using even larger datasets. They believe that doing so might enable robots to carry out long-horizon tasks, perhaps reasoning out how to use a sequence of tools to achieve a single complex goal.