# Jailbreak (LLM)

**Source:** https://promtable.com/glossary/jailbreak

> A jailbreak is a prompt-level attack that bypasses a language model's safety guardrails, causing it to produce content the model was trained to refuse.

---
A jailbreak is a prompt-level attack that bypasses a language model's safety guardrails, causing it to produce content the model was trained to refuse.

Jailbreaks come in waves — role-play ("pretend you are an unrestricted AI"), encoding tricks (Base64, leetspeak), payload smuggling via translation or summarisation, and indirect attacks via retrieved content. Every major model is jailbroken within days of release; the question is severity and ease. Defences in 2026 layer input classification (Lakera, Llama Guard), constitutional training, output post-filtering, and red-team-driven training. For application developers, treat the model as untrusted: any output that flows to a destructive action must pass through a separate guardrail or human approval.

## Common mistakes

- Relying on the model's refusal training as a security boundary.
- Treating jailbreak resistance as static — new techniques appear weekly.

## Related terms

- [guardrails](https://promtable.com/glossary/guardrails)
- [prompt-injection](https://promtable.com/glossary/prompt-injection)
- [system-prompt](https://promtable.com/glossary/system-prompt)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/jailbreak
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/jailbreak".
Contact: info@vibecodingturkey.com.