Safety classifier
A safety classifier is a small specialised model that scores LLM input or output for unsafe categories — toxicity, PII, prompt injection, jailbreak, NSFW — so the application can refuse, rewrite, or escalate.
Safety classifiers sit upstream of the LLM (filter incoming user prompts) and downstream (gate model output before it reaches the user or a destructive action). Production options in 2026 include Llama Guard (Meta), Lakera Guard, Prompt Shield (Microsoft), OpenAI Moderation. Per-category thresholds are tunable; reasonable defaults block obvious abuse and let benign edge cases through. Multi-layer is best practice: an input classifier + an output classifier + a final safety LLM judge on high-risk outputs.
When to use safety classifier
- Any user-facing LLM feature.
- Agents that take destructive actions on tool output.
- Regulated industries with explicit content rules.
Common mistakes
- Relying on a single classifier — they have category-specific false negatives.
- Tuning thresholds without measuring false-positive rate on real traffic.
FAQ
What is safety classifier?
A safety classifier is a small specialised model that scores LLM input or output for unsafe categories — toxicity, PII, prompt injection, jailbreak, NSFW — so the application can refuse, rewrite, or escalate.
When should I use safety classifier?
Any user-facing LLM feature. Agents that take destructive actions on tool output. Regulated industries with explicit content rules.
What are the most common mistakes with safety classifier?
Relying on a single classifier — they have category-specific false negatives. Tuning thresholds without measuring false-positive rate on real traffic.
Related terms
- Guardrails — Guardrails are deterministic checks layered around a language model to prevent unsafe, off-topic, or non-compliant outputs from reaching the user.
- Prompt injection — Prompt injection is an attack where hostile content in a model's input (a webpage, a retrieved document, a user message) overrides the system prompt's instructions.
- Jailbreak (LLM) — A jailbreak is a prompt-level attack that bypasses a language model's safety guardrails, causing it to produce content the model was trained to refuse.
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/safety-classifier.md.