concept

SWE-bench

SWE-bench is the standard benchmark for autonomous coding agents — real GitHub issues from popular Python repos paired with the actual fix commit; the agent must produce a patch that passes the hidden test suite.

SWE-bench was introduced in 2023 (Princeton) and quickly became the SWE agent leaderboard. Each task: an issue from a real OSS repo, repository state at the time, hidden test suite. The agent gets the issue + repo, must produce a diff that makes the hidden tests pass. Variants: SWE-bench Lite (filtered for solvable), SWE-bench Verified (human-vetted), SWE-bench Multilingual. As of 2026 top scores cluster around 60-70% on Verified; humans hit ~95%. Production caveat: SWE-bench performance is necessary but not sufficient — agents that score high can still flop on real-world tickets due to dependency setup, multi-repo context, or ambiguous specs.

When to use swe-bench

Common mistakes

FAQ

What is swe-bench?

SWE-bench is the standard benchmark for autonomous coding agents — real GitHub issues from popular Python repos paired with the actual fix commit; the agent must produce a patch that passes the hidden test suite.

When should I use swe-bench?

Comparing autonomous coding agents. Tracking progress over time.

What are the most common mistakes with swe-bench?

Reading SWE-bench scores as production capability — real tickets are harder.

Sources

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/swe-bench.md.