Evals (LLM evaluations)
Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
Evals turn prompt engineering from guesswork into engineering. The basic shape: a golden set of representative inputs paired with reference outputs (or graded rubrics), an automated scorer (rule-based, LLM-as-judge, or human), and a regression alarm that fires when scores drop. In 2026 every credible production LLM team runs evals on prompt changes, model upgrades, and routing decisions. Open-source frameworks: Inspect Evals (UK AI Safety), Ragas (RAG-specific), DeepEval, OpenAI evals. Hosted: Braintrust, Langfuse, Patronus. The shift in the field is from "vibes-based" prompt tweaking to test-driven prompting.
When to use evals (llm evaluations)
- Any production LLM feature.
- Before swapping a model provider or major prompt change.
- Continuous monitoring of live traffic samples.
Common mistakes
- Using an LLM judge that scores everything 5/5 — useless without calibration.
- Golden sets that don't reflect real input distribution.
- Treating one-time evals as the whole job — set up regression alarms.
FAQ
What is evals (llm evaluations)?
Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
When should I use evals (llm evaluations)?
Any production LLM feature. Before swapping a model provider or major prompt change. Continuous monitoring of live traffic samples.
What are the most common mistakes with evals (llm evaluations)?
Using an LLM judge that scores everything 5/5 — useless without calibration. Golden sets that don't reflect real input distribution. Treating one-time evals as the whole job — set up regression alarms.
Related terms
- Prompt engineering — Prompt engineering is the practice of designing input text that reliably steers a large language model toward a specific output.
- Guardrails — Guardrails are deterministic checks layered around a language model to prevent unsafe, off-topic, or non-compliant outputs from reaching the user.
- AI agent — An AI agent is a system where a language model autonomously plans and executes a sequence of tool calls to accomplish a goal.
- Model router — A model router picks which language model handles each request based on cost, latency, or task type — the standard production pattern in 2026.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/evals.md.