technique

LLM jury

An LLM jury is an evaluation pattern where multiple LLM judges score the same output, and their scores are aggregated to reduce single-judge bias.

Single-judge LLM evaluation suffers from systematic bias — one model tends to favour outputs that look like its own. An LLM jury runs 3-5 judges from different model families (Claude, GPT, Gemini), averages or majority-votes their scores, and reduces the bias significantly. The technique adds cost (N× the judge tokens) but is the cheapest way to make automated evals more reliable when human grading isn't feasible. By 2026 it's standard practice in serious eval pipelines: Braintrust, Ragas, and Patronus all support jury setups out of the box.

When to use llm jury

Common mistakes

FAQ

What is llm jury?

An LLM jury is an evaluation pattern where multiple LLM judges score the same output, and their scores are aggregated to reduce single-judge bias.

When should I use llm jury?

Production evals where bias from a single judge would skew results. High-stakes A/B testing of prompts or models.

What are the most common mistakes with llm jury?

Using the same model family for all jurors — defeats the bias reduction. Aggregating disagreeing jurors blindly — investigate when they diverge.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/llm-jury.md.