It Would Be My Privilege

OpenAI has introduced an instruction hierarchy for LLMs, giving greater privilege levels to certain types of prompts to enhance security.

Nick Bild
2 months agoMachine Learning & AI

Large language models (LLMs) really began to come of age with the release of OpenAI’s ChatGPT nearly two years ago. From that moment forward, everyone knew that these artificial intelligence algorithms would be very useful for something, but it took a while for us to figure out exactly what that something might be. To be certain, the possibilities are still being explored, but LLMs have been utilized in a number of practical applications ranging from web agents to virtual assistants and even robotic navigation systems.

There are still some factors that are preventing these tools from being more widely adopted at this time, however. One of those factors is that they are relatively insecure. Sometimes, for example, sensitive information or intellectual property can be extracted from them by doing little more than simply asking for it. Tricks like jailbreaks, system prompt extractions, and direct or indirect prompt injections can override any protections that have been put in place with minimal effort.

A team at OpenAI recently argued that the reason for these problems is a lack of instruction privileges in current LLMs. Whereas virtually any other software has some concept of privileges — perhaps with administrative accounts that can change any settings, and then a number of types of user accounts that have lesser levels of access — LLMs do not have the same types of controls. So they introduced the concept of an instruction hierarchy into LLM architecture. This hierarchy gives prompts a higher or lower level of privilege depending on the source it comes from.

The hierarchy gives the highest privilege level to the system messages that are supplied to the LLM by its developers. User messages are given a medium level of privilege, while model and tool outputs are only granted low privilege levels. By following this hierarchy, higher-level instructions are guaranteed to overrule lower-level instructions, making the job of malicious hackers much more difficult.

Well, that is the intention, at least. But implementing this hierarchy in the real-world gets a bit messy because the LLM still has to determine which prompts are benign, and which are an attempt to skirt the rules. To evaluate this, the team came up with the concept of aligned and misaligned instructions. Aligned instructions are in harmony with the higher-level instructions, while misaligned instructions take some unusual action that is intended to extract private data or otherwise break the safeguards that have been put in place.

Since there is endless variety to the text a user can prompt the model with, the instruction hierarchy cannot be hardcoded. Rather, the team had to generate synthetic data representing both aligned and misaligned instructions and train the model to recognize which class a prompt most likely belongs to. The trained model was then benchmarked against both open-source and novel datasets, and it was found that substantial additional levels of protection had been achieved. Protection against system prompt extractions, for example, was enhanced by 63 percent.

This solution is by no means perfect, and the cat-and-mouse game is sure to continue, but this is a step in the right direction. Perhaps with refinement, techniques such as this will enable LLMs to be used in more production applications. As of today, the instruction hierarchy approach is live in OpenAI’s GPT-4o mini model.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles