failure

Jailbreak (LLM)

A jailbreak is a prompt-level attack that bypasses a language model's safety guardrails, causing it to produce content the model was trained to refuse.

Jailbreaks come in waves — role-play ("pretend you are an unrestricted AI"), encoding tricks (Base64, leetspeak), payload smuggling via translation or summarisation, and indirect attacks via retrieved content. Every major model is jailbroken within days of release; the question is severity and ease. Defences in 2026 layer input classification (Lakera, Llama Guard), constitutional training, output post-filtering, and red-team-driven training. For application developers, treat the model as untrusted: any output that flows to a destructive action must pass through a separate guardrail or human approval.

Common mistakes

FAQ

What is jailbreak (llm)?

A jailbreak is a prompt-level attack that bypasses a language model's safety guardrails, causing it to produce content the model was trained to refuse.

What are the most common mistakes with jailbreak (llm)?

Relying on the model's refusal training as a security boundary. Treating jailbreak resistance as static — new techniques appear weekly.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/jailbreak.md.