Evals-driven development
Evals-driven development is the discipline of writing the eval suite first, then iterating prompts and models against it — borrowing test-driven development for LLM work.
Evals-driven development inverts the usual order: define what success looks like, encode it as an automated eval against a golden set, then iterate prompts, models, and orchestration until the evals pass. Adopted widely by serious LLM teams in 2026 because it's the only way to ship reliably — vibe-coding prompt changes without evals breaks production. Mature implementations integrate evals into CI: every prompt change runs the suite, regressions block merges, scores trend in a dashboard. Tools: Braintrust, Langfuse, Ragas, Inspect Evals.
When to use evals-driven development
- Any serious production LLM feature.
- Teams shipping multiple prompt changes per week.
Common mistakes
- Building evals after shipping — the prompt baked in regressions you didn't catch.
- Eval set that's too small or unrepresentative of real distribution.
FAQ
What is evals-driven development?
Evals-driven development is the discipline of writing the eval suite first, then iterating prompts and models against it — borrowing test-driven development for LLM work.
When should I use evals-driven development?
Any serious production LLM feature. Teams shipping multiple prompt changes per week.
What are the most common mistakes with evals-driven development?
Building evals after shipping — the prompt baked in regressions you didn't catch. Eval set that's too small or unrepresentative of real distribution.
Related terms
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
- A/B testing prompts — A/B testing prompts runs two prompt variants against the same input distribution and compares scored outputs, attributing quality differences to the prompt change.
- Prompt versioning — Prompt versioning is the discipline of treating prompts as source-controlled artefacts — each prompt has a versioned ID, a deploy history, and a regression-tested change log.
- Vibe eval — Vibe eval is the pejorative for unsystematic eyeball-grading of LLM output — "it feels better" rather than measurable rubric-based comparison. The opposite of proper evals.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/evals-driven-development.md.