LLM jury
An LLM jury is an evaluation pattern where multiple LLM judges score the same output, and their scores are aggregated to reduce single-judge bias.
Single-judge LLM evaluation suffers from systematic bias — one model tends to favour outputs that look like its own. An LLM jury runs 3-5 judges from different model families (Claude, GPT, Gemini), averages or majority-votes their scores, and reduces the bias significantly. The technique adds cost (N× the judge tokens) but is the cheapest way to make automated evals more reliable when human grading isn't feasible. By 2026 it's standard practice in serious eval pipelines: Braintrust, Ragas, and Patronus all support jury setups out of the box.
When to use llm jury
- Production evals where bias from a single judge would skew results.
- High-stakes A/B testing of prompts or models.
Common mistakes
- Using the same model family for all jurors — defeats the bias reduction.
- Aggregating disagreeing jurors blindly — investigate when they diverge.
FAQ
What is llm jury?
An LLM jury is an evaluation pattern where multiple LLM judges score the same output, and their scores are aggregated to reduce single-judge bias.
When should I use llm jury?
Production evals where bias from a single judge would skew results. High-stakes A/B testing of prompts or models.
What are the most common mistakes with llm jury?
Using the same model family for all jurors — defeats the bias reduction. Aggregating disagreeing jurors blindly — investigate when they diverge.
Related terms
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
- Self-consistency — Self-consistency runs the same prompt multiple times at non-zero temperature and picks the most common final answer, raising accuracy on reasoning tasks.
- Mixture of agents — Mixture of agents is an inference pattern where multiple specialised LLM agents run in parallel and a router aggregator combines their outputs into a single answer — higher quality than any single agent at higher cost.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/llm-jury.md.