Auto-eval (LLM)
Auto-eval is the automated grading of LLM output — usually by an LLM judge with a rubric — that replaces or supplements human grading in eval suites.
Auto-eval is the cost-effective core of evals discipline in 2026. The pattern: a strong model (Claude 4.6 Sonnet, GPT-4o, Gemini 2 Pro) takes the rubric + the candidate output + (sometimes) a reference output and produces a score per dimension. Tools: Braintrust, Ragas, DeepEval, Inspect Evals. Combine with periodic human spot-checks (sample 10%) to catch judge drift. Auto-eval makes it economical to run evals on every prompt change and every production sample without scaling a human-grading team.
When to use auto-eval (llm)
- Any production LLM feature with evals.
- Continuous monitoring of production samples.
Common mistakes
- Single auto-judge without calibration — drift goes unnoticed.
- Rubric too vague — auto-judge scores everything similarly.
FAQ
What is auto-eval (llm)?
Auto-eval is the automated grading of LLM output — usually by an LLM judge with a rubric — that replaces or supplements human grading in eval suites.
When should I use auto-eval (llm)?
Any production LLM feature with evals. Continuous monitoring of production samples.
What are the most common mistakes with auto-eval (llm)?
Single auto-judge without calibration — drift goes unnoticed. Rubric too vague — auto-judge scores everything similarly.
Related terms
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
- LLM jury — An LLM jury is an evaluation pattern where multiple LLM judges score the same output, and their scores are aggregated to reduce single-judge bias.
- A/B testing prompts — A/B testing prompts runs two prompt variants against the same input distribution and compares scored outputs, attributing quality differences to the prompt change.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/auto-eval.md.