concept

Evals (LLM evaluations)

Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.

Evals turn prompt engineering from guesswork into engineering. The basic shape: a golden set of representative inputs paired with reference outputs (or graded rubrics), an automated scorer (rule-based, LLM-as-judge, or human), and a regression alarm that fires when scores drop. In 2026 every credible production LLM team runs evals on prompt changes, model upgrades, and routing decisions. Open-source frameworks: Inspect Evals (UK AI Safety), Ragas (RAG-specific), DeepEval, OpenAI evals. Hosted: Braintrust, Langfuse, Patronus. The shift in the field is from "vibes-based" prompt tweaking to test-driven prompting.

When to use evals (llm evaluations)

Common mistakes

FAQ

What is evals (llm evaluations)?

Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.

When should I use evals (llm evaluations)?

Any production LLM feature. Before swapping a model provider or major prompt change. Continuous monitoring of live traffic samples.

What are the most common mistakes with evals (llm evaluations)?

Using an LLM judge that scores everything 5/5 — useless without calibration. Golden sets that don't reflect real input distribution. Treating one-time evals as the whole job — set up regression alarms.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/evals.md.