AI evals and observability in 2026: the working reference
How to actually evaluate and observe LLM systems in 2026 — golden sets, rubrics, LLM-as-judge, A/B testing in production, regression catching, the tracing stack, and the antipatterns that quietly ship broken AI.
Evals and observability are what separate teams who ship reliable AI from teams who ship vibes. By 2026 the discipline has consolidated: golden sets up front, rubrics shared across humans and LLM judges, traces on every call, sampled production grading, regression alarms on prompt and model changes. This page is the working reference for engineers shipping LLM-powered features.
Why evals + observability matter more in 2026
Three things changed by 2026:
- Models silently change. Frontier APIs version-bump constantly; behaviour drifts under stable model names. Without evals you find out from users.
- Agents have many steps. A single broken step in a 12-step agent produces a quietly wrong answer. Without traces you cannot find the broken step.
- Prompt changes have non-local effects. Tweaking one rule in a system prompt can regress dozens of cases that weren't covered when you tested. Golden sets catch this.
See the evals glossary entry for the canonical definition.
Anatomy of a usable eval
A production eval has five parts:
- A golden set of 50-500 representative inputs.
- Reference outputs — either exact answers (for extraction) or a rubric (for open-ended).
- A scorer — rule-based (regex, JSON shape), LLM-as-judge, or human.
- A runner that executes the model under the eval prompt and stores results with a version ID.
- A regression gate that compares the new run to the baseline and blocks deploy if a threshold is breached.
The golden set is the load-bearing piece. Spend the engineering time there.
Rubrics and LLM-as-judge
For anything beyond exact-match extraction, you need a rubric. A working rubric in 2026:
- Dimensions — usually 3-6 (accuracy, completeness, tone, format adherence, safety, faithfulness for RAG).
- Scale — 1-5 per dimension is the sweet spot. Binary collapses signal; 1-10 invites noise.
- Anchors — example outputs that score 1, 3, and 5 for each dimension. Calibrates the judge.
- Aggregation — typically a weighted sum or hard-pass on safety.
LLM-as-judge is the cheap automation. Best practice: use a strong model (Claude 4.6 Sonnet, GPT-4o, Gemini 2 Pro) as the judge, share the rubric verbatim, and hand-grade 10% of items to confirm the judge isn't drifting. Tools: Braintrust, Ragas, DeepEval, Inspect Evals.
Tracing the LLM call graph
Every production LLM call should land in a trace with:
- Prompt version ID (system + user templates).
- Inputs and rendered prompts.
- Model name + version.
- Token counts (input, cached, output).
- Latency per stage.
- Tool calls + their arguments + their results.
- For RAG: the retrieved documents + scores.
- User feedback (thumbs, action taken).
2026 stack: Langfuse (open source), Braintrust, Phoenix (Arize), LangSmith, OpenLLMetry on top of OpenTelemetry. Pick by team alignment — every option is competent.
Production sampling and grading
Offline evals do not catch drift on real-world inputs. Sample 1-5% of production traffic, run it through the rubric (LLM judge + occasional human review), and feed scores into a dashboard. When a metric trends bad, you know before users churn.
Important: production samples become the next golden set's seed material. Curate the ambiguous and the failing cases into your offline evals.
Catching regressions on prompt + model changes
Treat prompts like code: each change runs the eval suite before deploy.
- Prompt change — run the offline golden set against the new prompt vs old. Block deploy if pass rate drops more than, say, 3 percentage points.
- Model upgrade — run the suite against both models. Sometimes the "newer better" model regresses on your specific use case.
- Tool / RAG change — same drill, plus retrieval recall/precision metrics.
Eval CI on prompts is the simplest discipline that prevents the most production embarrassment.
The 2026 evals + observability stack
- Evals frameworks: Braintrust, Ragas (RAG-specific), DeepEval, Inspect Evals (UK AISI), OpenAI evals, Patronus.
- Tracing: Langfuse, Braintrust, Phoenix, LangSmith, OpenLLMetry / OpenTelemetry-native.
- Prompt registries: Braintrust, Vellum, Langfuse, PromptLayer.
- Production grading: Patronus, Lakera, custom.
- Dashboards: rolled into the above or built in Grafana / Metabase.
A credible 2026 stack: Langfuse for traces + prompt registry, Braintrust for evals, Lakera for safety guardrails, OpenLLMetry for OTel-native shops.
Antipatterns
- No evals. Every prompt change is a hope.
- LLM-as-judge with no human calibration. The judge silently drifts; you trust noise.
- Vibes-based "spot-check". Doesn't scale, doesn't catch regressions, doesn't survive headcount turnover.
- Tracing only failures. You need samples of successes to know what good looks like.
- Mixing dev and prod prompts in the same registry without version IDs. You won't know which version answered which user.
- Treating evals as one-time launch readiness. Drift starts immediately after launch.
FAQ
What are LLM evals in 2026?
Systematic tests against a golden set of inputs scored with a rubric. Standard for any production LLM feature. See the evals glossary entry.
Should I use LLM-as-judge or human grading?
Both. LLM-as-judge for cheap automation across the whole golden set; human grading for ~10% to calibrate the judge and catch its drift.
Best tools for LLM evals and tracing in 2026?
Braintrust + Langfuse cover most needs. Ragas for RAG-specific. Inspect Evals from UK AISI for safety-focused work.
How big should my golden set be?
50 to start, 200-500 for serious evaluation. Quality of cases matters more than quantity — cover edge cases that have actually broken in production.
How do I catch regressions when prompts change?
Run the golden set on every prompt change. Block deploy if pass rate drops more than ~3 percentage points. Track version IDs in tracing so you can attribute regressions.
Last updated: 2026-06-01 · Author: Onur Hüseyin Koçak.