Golden set
A golden set is the small curated collection of (input, expected output) pairs that defines 'correct' behavior for an LLM feature — used as the regression baseline for evals, A/B tests, prompt experiments. Quality + curation matters more than size.
Without a golden set, prompt + model changes ship blindly: maybe better, maybe worse. Golden sets fix this: 50-500 hand-curated examples with known-correct outputs (or human-rated 'good' standards), run every change against them, track win rate. Curation patterns: harvest from prod traces (real user queries + verified-correct answers), cover edge cases (empty input, adversarial, unusual formats), refresh quarterly (prod distribution drifts). Production tools: Braintrust + Langfuse + Vellum all treat datasets as first-class. Trade-offs: small set misses edge cases; large set is expensive to maintain; LLM-judged outputs can carry judge bias. The golden set is the canonical safeguard against silent regressions.
When to use golden set
- Any production AI feature with non-trivial quality bar.
Common mistakes
- Stale golden set — months-old examples don't reflect current users.
- Auto-generated synthetic golden sets — measure the synthesizer, not the model.
FAQ
What is golden set?
A golden set is the small curated collection of (input, expected output) pairs that defines 'correct' behavior for an LLM feature — used as the regression baseline for evals, A/B tests, prompt experiments. Quality + curation matters more than size.
When should I use golden set?
Any production AI feature with non-trivial quality bar.
What are the most common mistakes with golden set?
Stale golden set — months-old examples don't reflect current users. Auto-generated synthetic golden sets — measure the synthesizer, not the model.
Related terms
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
- Regression suite (LLM) — A regression suite is the standing set of evals that runs on every prompt change, model upgrade, or pipeline modification — designed to catch quality regressions on previously-working cases.
- Evals-driven development — Evals-driven development is the discipline of writing the eval suite first, then iterating prompts and models against it — borrowing test-driven development for LLM work.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/golden-set.md.