SWE-bench
SWE-bench is the standard benchmark for autonomous coding agents — real GitHub issues from popular Python repos paired with the actual fix commit; the agent must produce a patch that passes the hidden test suite.
SWE-bench was introduced in 2023 (Princeton) and quickly became the SWE agent leaderboard. Each task: an issue from a real OSS repo, repository state at the time, hidden test suite. The agent gets the issue + repo, must produce a diff that makes the hidden tests pass. Variants: SWE-bench Lite (filtered for solvable), SWE-bench Verified (human-vetted), SWE-bench Multilingual. As of 2026 top scores cluster around 60-70% on Verified; humans hit ~95%. Production caveat: SWE-bench performance is necessary but not sufficient — agents that score high can still flop on real-world tickets due to dependency setup, multi-repo context, or ambiguous specs.
When to use swe-bench
- Comparing autonomous coding agents.
- Tracking progress over time.
Common mistakes
- Reading SWE-bench scores as production capability — real tickets are harder.
FAQ
What is swe-bench?
SWE-bench is the standard benchmark for autonomous coding agents — real GitHub issues from popular Python repos paired with the actual fix commit; the agent must produce a patch that passes the hidden test suite.
When should I use swe-bench?
Comparing autonomous coding agents. Tracking progress over time.
What are the most common mistakes with swe-bench?
Reading SWE-bench scores as production capability — real tickets are harder.
Related terms
- Autonomous coder — An autonomous coder is an LLM agent that accepts a high-level task (a ticket, an issue, a feature request) and produces a working PR without step-by-step human guidance — Devin, OpenHands, Sweep, SWE-Agent, Claude Code's agent mode are 2026 examples.
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
- Agent tracing — Agent tracing captures the full execution graph of an agent run — every step, every tool call, every model output — so engineers can debug, audit, and improve the agent over time.
Sources
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/swe-bench.md.