RVT-2 Is a Fast Learner

NVIDIA's RVT-2 teaches robots complex tasks with millimeter precision from just a few examples, and it also runs faster than other systems.

RVT-2 learns complex tasks from a few examples (📷: A. Goyal et al.)

We have big dreams for the robots of the future. We want them to be able to do everything from cooking and cleaning to driving us to work. But while many steps in the right direction have been taken in recent years, we are still a long way from this ultimate goal. And unless new techniques are developed, it is going to stay that way for some time to come.

Much of the difficulty stems from the fact that the sort of tasks we want our robots to do are very complex. Consider cooking a meal, for example. This requires any number of delicate and precise actions, from selecting the right ingredients to chopping vegetables, monitoring cooking times, and adjusting heat levels. Each of these tasks involves a high degree of sensory perception and fine motor control, which are areas where robots still struggle. Moreover, cooking — good cooking, anyway — also requires a level of creativity and problem-solving ability that robots currently lack.

The system architecture (📷: A. Goyal et al.)

In order to successfully carry out complex tasks such as these, especially in a wide range of environments like those that would be found in the real world, today’s artificial intelligence algorithms require a very large number of examples to learn from. Further considering that we want our robots to do not one thing, but many things, the number of examples quickly becomes unmanageable. Until we rethink our strategy, general-purpose robots are likely to remain out of reach.

A team of NVIDIA engineers is working to change this present paradigm, and their efforts have resulted in the development of a multitask 3D manipulation model called RVT-2. This model is capable of learning from just a few demonstrations in many cases, and the training and inference speeds are also much faster than previous techniques, which further enhance its practicality for real-world applications.

Several key innovations made this possible. First, RVT-2 features a multi-stage inference pipeline, allowing a robot to focus on specific regions of interest, thus enabling more precise end-effector movements. To optimize memory usage and speed during training, RVT-2 employs a convex upsampling technique. Additionally, it improves the accuracy of end-effector rotation predictions by utilizing location-conditioned features, which provide detailed, context-specific information rather than relying on global scene data.

Stacking blocks

RVT-2 also benefits from a custom virtual image renderer, which replaces the generic renderer used in previous work. This specialized tool enhances both training and inference speeds while reducing memory consumption. The system also leverages cutting-edge practices in training transformer models, including the use of fast optimizers and mixed-precision training, to further improve its learning efficiency and performance.

These architectural and system-level improvements enable RVT-2 to handle tasks requiring millimeter-level precision, such as inserting a peg into a hole or plugging into a socket, with only a few demonstrations and using just a single third-person camera. As a result, RVT-2 sets new benchmarks in 3D manipulation, demonstrating significant advancements in training speed, inference speed, and task success rates. For those that want to dig deeper into the technical details, the source code is available on GitHub.

nickbild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Latest Articles