# Safety classifier

**Source:** https://promtable.com/glossary/safety-classifier

> A safety classifier is a small specialised model that scores LLM input or output for unsafe categories — toxicity, PII, prompt injection, jailbreak, NSFW — so the application can refuse, rewrite, or escalate.

---
A safety classifier is a small specialised model that scores LLM input or output for unsafe categories — toxicity, PII, prompt injection, jailbreak, NSFW — so the application can refuse, rewrite, or escalate.

Safety classifiers sit upstream of the LLM (filter incoming user prompts) and downstream (gate model output before it reaches the user or a destructive action). Production options in 2026 include Llama Guard (Meta), Lakera Guard, Prompt Shield (Microsoft), OpenAI Moderation. Per-category thresholds are tunable; reasonable defaults block obvious abuse and let benign edge cases through. Multi-layer is best practice: an input classifier + an output classifier + a final safety LLM judge on high-risk outputs.

## When to use

- Any user-facing LLM feature.
- Agents that take destructive actions on tool output.
- Regulated industries with explicit content rules.

## Common mistakes

- Relying on a single classifier — they have category-specific false negatives.
- Tuning thresholds without measuring false-positive rate on real traffic.

## Related terms

- [guardrails](https://promtable.com/glossary/guardrails)
- [prompt-injection](https://promtable.com/glossary/prompt-injection)
- [jailbreak](https://promtable.com/glossary/jailbreak)
- [evals](https://promtable.com/glossary/evals)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/safety-classifier
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/safety-classifier".
Contact: info@vibecodingturkey.com.