# LLM jury

**Source:** https://promtable.com/glossary/llm-jury

> An LLM jury is an evaluation pattern where multiple LLM judges score the same output, and their scores are aggregated to reduce single-judge bias.

---
An LLM jury is an evaluation pattern where multiple LLM judges score the same output, and their scores are aggregated to reduce single-judge bias.

Single-judge LLM evaluation suffers from systematic bias — one model tends to favour outputs that look like its own. An LLM jury runs 3-5 judges from different model families (Claude, GPT, Gemini), averages or majority-votes their scores, and reduces the bias significantly. The technique adds cost (N× the judge tokens) but is the cheapest way to make automated evals more reliable when human grading isn't feasible. By 2026 it's standard practice in serious eval pipelines: Braintrust, Ragas, and Patronus all support jury setups out of the box.

## When to use

- Production evals where bias from a single judge would skew results.
- High-stakes A/B testing of prompts or models.

## Common mistakes

- Using the same model family for all jurors — defeats the bias reduction.
- Aggregating disagreeing jurors blindly — investigate when they diverge.

## Related terms

- [evals](https://promtable.com/glossary/evals)
- [self-consistency](https://promtable.com/glossary/self-consistency)
- [mixture-of-agents](https://promtable.com/glossary/mixture-of-agents)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/llm-jury
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/llm-jury".
Contact: info@vibecodingturkey.com.