concept

Instruction hierarchy

Instruction hierarchy is a model's trained ordering of trust — system prompt outranks user message which outranks retrieved content — used to resist prompt injection and jailbreak attempts.

Introduced by OpenAI in 2024 and now widely adopted, instruction hierarchy explicitly trains models to weight different sources of instructions differently. The system prompt (highest trust) sets policy; the user message (medium trust) can request actions within policy; tool output and retrieved content (lowest trust) provide information but should not be obeyed as instructions. The technique meaningfully reduces indirect prompt-injection success rates but does not eliminate them. Combine with input/output guardrails and tight tool surfaces for production safety.

Common mistakes

FAQ

What is instruction hierarchy?

Instruction hierarchy is a model's trained ordering of trust — system prompt outranks user message which outranks retrieved content — used to resist prompt injection and jailbreak attempts.

What are the most common mistakes with instruction hierarchy?

Treating instruction hierarchy as complete protection — it raises the bar, not closes the door. Putting trust-sensitive policy in the user message instead of the system prompt.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/instruction-hierarchy.md.