tool

Safety classifier

A safety classifier is a small specialised model that scores LLM input or output for unsafe categories — toxicity, PII, prompt injection, jailbreak, NSFW — so the application can refuse, rewrite, or escalate.

Safety classifiers sit upstream of the LLM (filter incoming user prompts) and downstream (gate model output before it reaches the user or a destructive action). Production options in 2026 include Llama Guard (Meta), Lakera Guard, Prompt Shield (Microsoft), OpenAI Moderation. Per-category thresholds are tunable; reasonable defaults block obvious abuse and let benign edge cases through. Multi-layer is best practice: an input classifier + an output classifier + a final safety LLM judge on high-risk outputs.

When to use safety classifier

Common mistakes

FAQ

What is safety classifier?

A safety classifier is a small specialised model that scores LLM input or output for unsafe categories — toxicity, PII, prompt injection, jailbreak, NSFW — so the application can refuse, rewrite, or escalate.

When should I use safety classifier?

Any user-facing LLM feature. Agents that take destructive actions on tool output. Regulated industries with explicit content rules.

What are the most common mistakes with safety classifier?

Relying on a single classifier — they have category-specific false negatives. Tuning thresholds without measuring false-positive rate on real traffic.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/safety-classifier.md.