# Evals (LLM evaluations)

**Source:** https://promtable.com/glossary/evals

> Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.

---
Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.

Evals turn prompt engineering from guesswork into engineering. The basic shape: a golden set of representative inputs paired with reference outputs (or graded rubrics), an automated scorer (rule-based, LLM-as-judge, or human), and a regression alarm that fires when scores drop. In 2026 every credible production LLM team runs evals on prompt changes, model upgrades, and routing decisions. Open-source frameworks: Inspect Evals (UK AI Safety), Ragas (RAG-specific), DeepEval, OpenAI evals. Hosted: Braintrust, Langfuse, Patronus. The shift in the field is from "vibes-based" prompt tweaking to test-driven prompting.

## When to use

- Any production LLM feature.
- Before swapping a model provider or major prompt change.
- Continuous monitoring of live traffic samples.

## Common mistakes

- Using an LLM judge that scores everything 5/5 — useless without calibration.
- Golden sets that don't reflect real input distribution.
- Treating one-time evals as the whole job — set up regression alarms.

## Related terms

- [prompt-engineering](https://promtable.com/glossary/prompt-engineering)
- [guardrails](https://promtable.com/glossary/guardrails)
- [agent](https://promtable.com/glossary/agent)
- [model-router](https://promtable.com/glossary/model-router)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/evals
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/evals".
Contact: info@vibecodingturkey.com.