A Lesson Learned

MIT's HPT architecture helps robots learn new tricks by giving them the ability to understand a diverse array of training data types.

HPT teaches robots new tricks by using diverse data sources (📷: L. Wang et al.)

In theory, a properly equipped robot — with the help of an appropriate learning algorithm — can do just about anything that a human can do. But in practice, all sorts of challenges pop up that have thus far stymied our best efforts to build general-purpose robots that can do everything from cooking and cleaning to folding our laundry. The biggest challenge of all may not be what most people would think. It has less to do with advances in robotics or sensing technologies, and even cutting-edge machine learning algorithms, than it does with mundane tasks like data collection.

Yes, boring old data collection, of all things. Machine learning algorithms need data to learn from. And when the tasks to complete are complex and involve dynamic environments, they need mountains of it. This is practical enough when a robot only needs to do a few things, but the problem quickly gets out of hand when one starts talking about a general-purpose robot that can do anything that is asked of it. Collecting and annotating a dataset large enough to crack this problem is just not realistic.

The architecture of HPT (📷: L. Wang et al.)

There is no apparent path forward to solve this problem now, or in the foreseeable future, so a different approach is clearly needed. And that is exactly what a team of researchers at MIT CSAIL and Meta has recently proposed. They have developed a new algorithm architecture called Heterogeneous Pretrained Transformers (HPT) that can learn from all different types of data to understand what is required to complete a task. It is hoped that by not being too picky about the specific type of data that it needs, HPT can leverage the large amounts of data that have already been collected to learn things that the data was never originally intended for in the first place — and sidestep the impracticalities associated with collecting impossibly large purpose-built datasets in the process.

The HPT architecture expands upon the existing deep learning architecture known as a transformer, which is similar to those used in large language models (LLMs) like GPT-4. The researchers adapted this transformer to process diverse robotic inputs, such as vision and proprioceptive data, by converting them into a standardized format called tokens. These tokens allow HPT to interpret and align data from multiple sources into a single, shared language that the model can understand and build upon. This approach is scalable, allowing HPT to improve its performance as it trains on increasing amounts of data.

Real-world tests of an HPT model (📷: L. Wang et al.)

Once pretrained with a large dataset, HPT only requires a small amount of robot-specific data to learn new tasks, making it significantly more efficient than training from scratch. Testing has shown that HPT improves robotic task performance by over 20 percent, even for tasks not included in the training data. As such, this method allows for rapid adaptation across different robots and tasks, with the potential to expand robotics similarly to how LLMs have revolutionized language understanding. Future work aims to enhance HPT’s capacity to handle even more diverse data and potentially enable robots to perform tasks without any additional training.

machine learning

artificial intelligence

robotics

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

A Lesson Learned

MIT's HPT architecture helps robots learn new tricks by giving them the ability to understand a diverse array of training data types.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles