Regression baseline
A regression baseline is the recorded performance of the current production prompt + model against the golden set — every change must match or beat this baseline before merge.
Without a regression baseline, prompt + model changes ship by hope. Baseline + eval: capture current win rate against the [[golden-set]], gate merges on 'change must not drop > X percentage points', publish the diff in the PR. Implementations: Braintrust experiment diffs, Langfuse evals, custom CI scripts running an eval suite. Production patterns: baseline per route (chat vs classification vs extraction), set tolerance per route (chat: 0%; bulk classification: 1-2% acceptable), require human override for regressions (gate, not block). The baseline + golden set + eval CI is what turns AI from artisanal to engineered.
When to use regression baseline
- Production AI features with eval pipelines.
Common mistakes
- Skipping baseline — silent regressions ship.
- Setting tolerance too loose — every change shrinks quality by a few percent.
FAQ
What is regression baseline?
A regression baseline is the recorded performance of the current production prompt + model against the golden set — every change must match or beat this baseline before merge.
When should I use regression baseline?
Production AI features with eval pipelines.
What are the most common mistakes with regression baseline?
Skipping baseline — silent regressions ship. Setting tolerance too loose — every change shrinks quality by a few percent.
Related terms
- Golden set — A golden set is the small curated collection of (input, expected output) pairs that defines 'correct' behavior for an LLM feature — used as the regression baseline for evals, A/B tests, prompt experiments. Quality + curation matters more than size.
- Evals-driven development — Evals-driven development is the discipline of writing the eval suite first, then iterating prompts and models against it — borrowing test-driven development for LLM work.
- Regression suite (LLM) — A regression suite is the standing set of evals that runs on every prompt change, model upgrade, or pipeline modification — designed to catch quality regressions on previously-working cases.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/regression-baseline.md.