# Golden set

**Source:** https://promtable.com/glossary/golden-set

> A golden set is the small curated collection of (input, expected output) pairs that defines 'correct' behavior for an LLM feature — used as the regression baseline for evals, A/B tests, prompt experiments. Quality + curation matters more than size.

---
A golden set is the small curated collection of (input, expected output) pairs that defines 'correct' behavior for an LLM feature — used as the regression baseline for evals, A/B tests, prompt experiments. Quality + curation matters more than size.

Without a golden set, prompt + model changes ship blindly: maybe better, maybe worse. Golden sets fix this: 50-500 hand-curated examples with known-correct outputs (or human-rated 'good' standards), run every change against them, track win rate. Curation patterns: harvest from prod traces (real user queries + verified-correct answers), cover edge cases (empty input, adversarial, unusual formats), refresh quarterly (prod distribution drifts). Production tools: Braintrust + Langfuse + Vellum all treat datasets as first-class. Trade-offs: small set misses edge cases; large set is expensive to maintain; LLM-judged outputs can carry judge bias. The golden set is the canonical safeguard against silent regressions.

## When to use

- Any production AI feature with non-trivial quality bar.

## Common mistakes

- Stale golden set — months-old examples don't reflect current users.
- Auto-generated synthetic golden sets — measure the synthesizer, not the model.

## Related terms

- [evals](https://promtable.com/glossary/evals)
- [regression-suite](https://promtable.com/glossary/regression-suite)
- [evals-driven-development](https://promtable.com/glossary/evals-driven-development)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/golden-set
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/golden-set".
Contact: info@vibecodingturkey.com.