concept

Auto-eval (LLM)

Auto-eval is the automated grading of LLM output — usually by an LLM judge with a rubric — that replaces or supplements human grading in eval suites.

Auto-eval is the cost-effective core of evals discipline in 2026. The pattern: a strong model (Claude 4.6 Sonnet, GPT-4o, Gemini 2 Pro) takes the rubric + the candidate output + (sometimes) a reference output and produces a score per dimension. Tools: Braintrust, Ragas, DeepEval, Inspect Evals. Combine with periodic human spot-checks (sample 10%) to catch judge drift. Auto-eval makes it economical to run evals on every prompt change and every production sample without scaling a human-grading team.

When to use auto-eval (llm)

Common mistakes

FAQ

What is auto-eval (llm)?

Auto-eval is the automated grading of LLM output — usually by an LLM judge with a rubric — that replaces or supplements human grading in eval suites.

When should I use auto-eval (llm)?

Any production LLM feature with evals. Continuous monitoring of production samples.

What are the most common mistakes with auto-eval (llm)?

Single auto-judge without calibration — drift goes unnoticed. Rubric too vague — auto-judge scores everything similarly.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/auto-eval.md.