Out of the Lab and Into the Wild
WildLMa blends the latest in AI with VR to produce multitasking robots that might eventually give us a hand with our household chores.
Saying that someone cannot walk and chew gum at the same time may be a rude expression, but when it comes to robots, it is more or less true. Of course the idiom is not to be taken literally — gum-chewing robots are not exactly in high demand — but there are all sorts of applications for robots that can, say, walk and pick things up, or work with tools, all at the same time. But this raises so many complex issues that the problem has yet to be solved effectively.
Multitasking robots of today have difficulty when it comes to chaining together a long string of actions, as would be required when carrying out complex, long-horizon tasks. They also tend to have a lot of difficulty when it comes to generalizing to new situations. Things might look quite alright in the lab, but when the robot is released into the wild it quickly becomes clear that it cannot, well, walk and chew gum at the same time, so to speak.
Current approaches to mobile robot manipulation fall into two categories: modular methods and end-to-end learning approaches. Modular methods separate perception (object recognition) and planning but rely on heuristic-based motion planning, which limits them to simple tasks like pick-and-place despite advancements in generalizable perception using models like CLIP. End-to-end approaches unify perception and action through learned policies, enabling complex behaviors, but they struggle with generalization to new environments and suffer from compounding errors during long tasks, especially with imitation learning.
The WildLMa framework, just introduced by a team at UC San Diego, MIT, and NVIDIA, addresses the limitations of existing approaches by combining robust skill learning with effective task planning for mobile robot manipulation.
The design of the framework integrates two core components — WildLMa-Skill for skill acquisition and WildLMa-Planner for task execution. WildLMa-Skill focuses on learning atomic, reusable skills through language-conditioned imitation learning. It uses pre-trained vision-language models like CLIP to map language queries (e.g., “find the red bottle”) to visual representations, enhanced by a reparameterization technique that generates probability maps to improve accuracy. Skills are taught via virtual reality teleoperation, where human demonstrations of complex actions are captured using a learned low-level controller, expanding the robot’s capabilities and reducing demonstration costs. Once these skills are acquired, WildLMa-Planner integrates them into a library and connects with large language models to interpret human instructions and sequence the appropriate skills for multi-step tasks.
WildLMa was evaluated in a series of experiments using a Unitree B1 quadruped robot equipped with a Z1 arm, custom gripper, multiple cameras, and LiDAR for navigation and manipulation. The framework was tested in two settings: in-distribution, where object arrangements and environments were similar to training, and out-of-distribution (O.O.D.), which introduced variations in object placement, textures, and backgrounds. Comparisons were made against several baselines, including imitation learning methods, reinforcement learning approaches, and zero-shot grasping techniques. Results showed that WildLMa achieved the highest success rates, especially in O.O.D. scenarios, due to its enhanced skill generalization capabilities. It also demonstrated superior performance in long-horizon tasks and real-world applications, effectively handling perturbations.
By releasing their work, the team hopes that they will motivate future research in this area and move us closer to the deployment of practical, multitasking robots that can assist us with real-world tasks.
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.