Jailbreak (LLM)
A jailbreak is a prompt-level attack that bypasses a language model's safety guardrails, causing it to produce content the model was trained to refuse.
Jailbreaks come in waves — role-play ("pretend you are an unrestricted AI"), encoding tricks (Base64, leetspeak), payload smuggling via translation or summarisation, and indirect attacks via retrieved content. Every major model is jailbroken within days of release; the question is severity and ease. Defences in 2026 layer input classification (Lakera, Llama Guard), constitutional training, output post-filtering, and red-team-driven training. For application developers, treat the model as untrusted: any output that flows to a destructive action must pass through a separate guardrail or human approval.
Common mistakes
- Relying on the model's refusal training as a security boundary.
- Treating jailbreak resistance as static — new techniques appear weekly.
FAQ
What is jailbreak (llm)?
A jailbreak is a prompt-level attack that bypasses a language model's safety guardrails, causing it to produce content the model was trained to refuse.
What are the most common mistakes with jailbreak (llm)?
Relying on the model's refusal training as a security boundary. Treating jailbreak resistance as static — new techniques appear weekly.
Related terms
- Guardrails — Guardrails are deterministic checks layered around a language model to prevent unsafe, off-topic, or non-compliant outputs from reaching the user.
- Prompt injection — Prompt injection is an attack where hostile content in a model's input (a webpage, a retrieved document, a user message) overrides the system prompt's instructions.
- System prompt — A system prompt is the high-priority instruction block that defines a model's role, constraints, and default behaviors for an entire conversation.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/jailbreak.md.