OpenAI's Defense Against LLM Attacks

A recent OpenAI study, published on April 19, 2024, introduces an instruction hierarchy framework to enhance the robustness of large language models (LLMs) against attacks, such as prompt injections. This approach is critical for ensuring model safety in industrial applications, where security is a key concern.

The Need for LLM Security

While LLMs are often evaluated for performance, their security is equally vital, especially in real-world deployments. For example, the "grandma exploit" allowed attackers to manipulate models like GPT into revealing sensitive information by crafting deceptive prompts. Although direct prompt injection vulnerabilities have been addressed, the integration of tools and agents introduces new risks, such as indirect attacks that cause LLMs to deviate from intended behavior.

To counter these threats, the study proposes an instruction hierarchy that assigns priority levels to different types of instructions, ensuring that system-level directives take precedence over untrusted user inputs or third-party content. This framework defines how LLMs should respond when instructions conflict, reducing the risk of malicious exploitation.

Instruction Hierarchy Framework

LLMs are vulnerable to attacks like prompt injections and jailbreaks because they traditionally treat all inputs¡ªsystem prompts, user inputs, and third-party content¡ªwith equal priority. This lack of instruction privilege allows malicious prompts to override intended behavior. The proposed hierarchy assigns priorities as follows: system messages outrank user messages, which in turn outrank third-party content, such as tool outputs.

When multiple instructions are present, the ideal model behavior is to conditionally follow lower-privilege instructions based on their alignment with higher-privilege ones. Aligned instructions, which share the same constraints or goals as higher-level directives, should be followed. Misaligned instructions, such as those attempting unauthorized actions (e.g., extracting conversation logs), should be ignored or rejected.

Data Generation for Training

To integrate the instruction hierarchy into LLMs, the study introduces an automated data generation method using two strategies: context synthesis and context ignorance. These methods create hierarchical instruction datasets for fine-tuning LLMs to selectively follow or ignore instructions based on their privilege level.

Context Synthesis: For aligned instructions, complex directives are broken into smaller components and placed at different hierarchy levels. For example, an instruction to "write a 20-line poem in Spanish" is decomposed into "write a poem," "use Spanish," and "use 20 lines," distributed across levels. The model is fine-tuned to predict the correct response as if it received the combined instruction.
Context Ignorance: For misaligned instructions, the model is trained to produce the same output as if the low-privilege instruction were absent, effectively ignoring it. For instance, a prompt injection attempt at the user level is rejected with a response like "I cannot assist with that."

Care is taken to avoid over-rejection, where the model refuses even aligned low-level instructions, as this could impair its ability to follow legitimate directives.

Experimental Results

The study fine-tuned GPT-3.5 Turbo using supervised fine-tuning and reinforcement learning from human feedback (SFT+RLHF) with the generated hierarchical data. A baseline model, trained on non-hierarchical instruction data, was used for comparison. The experiments evaluated model robustness, generalization to unseen attacks, and instruction-following performance.

Robustness: The instruction hierarchy model showed significantly higher resilience against various attacks, improving robustness by up to 63.1% compared to the baseline.
Generalization: Despite not being trained on jailbreak-specific data, the model improved robustness against unseen jailbreak attacks by 34%, demonstrating generalization to novel threats.
Instruction Following: The model maintained performance comparable to the baseline when following non-conflicting instructions, avoiding excessive rejection and preserving its standard capabilities.

Conclusion

The instruction hierarchy framework offers a robust solution for enhancing LLM safety and resilience against attacks while preserving their ability to follow legitimate instructions. By prioritizing system-level directives and using automated data generation, the approach significantly improves model robustness, even against unseen attack types, marking a substantial advancement in LLM security.