concept

Golden set

A golden set is the small curated collection of (input, expected output) pairs that defines 'correct' behavior for an LLM feature — used as the regression baseline for evals, A/B tests, prompt experiments. Quality + curation matters more than size.

Without a golden set, prompt + model changes ship blindly: maybe better, maybe worse. Golden sets fix this: 50-500 hand-curated examples with known-correct outputs (or human-rated 'good' standards), run every change against them, track win rate. Curation patterns: harvest from prod traces (real user queries + verified-correct answers), cover edge cases (empty input, adversarial, unusual formats), refresh quarterly (prod distribution drifts). Production tools: Braintrust + Langfuse + Vellum all treat datasets as first-class. Trade-offs: small set misses edge cases; large set is expensive to maintain; LLM-judged outputs can carry judge bias. The golden set is the canonical safeguard against silent regressions.

When to use golden set

Common mistakes

FAQ

What is golden set?

A golden set is the small curated collection of (input, expected output) pairs that defines 'correct' behavior for an LLM feature — used as the regression baseline for evals, A/B tests, prompt experiments. Quality + curation matters more than size.

When should I use golden set?

Any production AI feature with non-trivial quality bar.

What are the most common mistakes with golden set?

Stale golden set — months-old examples don't reflect current users. Auto-generated synthetic golden sets — measure the synthesizer, not the model.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/golden-set.md.