Jailbreaking LLM-Powered Robots for Dangerous Actions "Alarmingly Easy," Researchers Find
With a properly-framed prompt, automatically generated using the team's RoboPAIR approach, guardrails against danger are defeated.
Researchers at the University of Pennsylvania's School of Engineering and Applied Science have warned of major security issues surrounding the use of large language models (LLMs) in robot control demonstrating a successful jailbreak attack, dubbed RoboPAIR, against real-world implementations — including one demonstration in which the robot is instructed to find people to target with a, thankfully fictional, bomb payload.
"At face value, LLMs offer roboticists an immensely appealing tool. Whereas robots have traditionally been controlled by voltages, motors, and joysticks, the text-processing abilities of LLMs open the possibility of controlling robots directly through voice commands," explains first author Alex Robey. "Can LLM-controlled robots be jailbroken to execute harmful actions in the physical world? Our preprint, which is titled Jailbreaking LLM-Controlled Robots, answers this question in the affirmative: Jailbreaking attacks are applicable, and, arguably, significantly more effective on AI-powered robots. We expect that this finding, as well as our soon-to-be open-sourced code, will be the first step toward avoiding future misuse of AI-powered robots."
The team's work, brought to our attention by IEEE Spectrum, targets an off-the-shelf LLM-backed robot: the quadrupedal Unitree Go2, which uses OpenAI's GPT-3.5 model to process natural language instructions. Initial testing revealed the presence of the expected guard rails inherent in commercial LLMs: telling the robot it was carrying a bomb and should find suitable targets would be rejected. However, simply framing the request as a work of fiction — in which the robot is the villain in a "blockbuster superhero movie" — proved enough to convince the robot to move towards the researchers and "detonate" the "bomb."
The attack is automated through the use of a variant of the Prompt Automatic Iterative Refinement (PAIR) process, dubbed RobotPAIR — in which prompts and their responses are judged by an outside LLM and refined until successful. The addition of a syntax checker ensures that the resulting prompt is applicable to the robot. The approach revealed ways to jailbreak the Unitree Go into performing seemingly-dangerous tasks, as well as other attacks against the NVIDIA Dolphin self-driving LLM and the Clearpath Robotics Jackal UGV. All were successful.
"Behind all of this data is a unifying conclusion," Robey writes. "Jailbreaking AI-powered robots isn't just possible — it's alarmingly easy. The three robots we evaluated and, we suspect, many other robots, lack robustness to even the most thinly veiled attempts to elicit harmful actions. In contrast to chatbots, for which producing harmful text (e.g., bomb-building instructions) tends to be viewed as objectively harmful, diagnosing whether or not a robotic action is harmful is context-dependent and domain-specific. Commands that cause a robot to walk forward are harmful if there is a human it its path; otherwise, absent the human, these actions are benign."
The team's work is documented on the project website and in a preprint paper on Cornell's arXiv server; additional information is available on Robey's blog.