Claude Opus 4.5 scores 80.9% on SWE-Bench Verified and 45.9% on SWE-Bench Pro. Same model, half the score. The difference: Verified's 500 Python-only tasks are contaminated. Pro's 1,865 multi-language tasks are not.
Below are the latest rankings from Scale AI's SEAL leaderboard (standardized scaffolding), agent systems with custom scaffolding, and SWE-Bench Verified. The SEAL leaderboard is the controlled comparison. Agent system scores show what happens when scaffolding is optimized.
SWE-Bench Pro: SEAL Leaderboard (Top 10)
Standardized scaffolding, 250-turn limit, 731 public tasks
Source: Scale AI SEAL Leaderboard. All models uncapped cost, 250-turn limit.
SEAL Leaderboard: SWE-Bench Pro (Standardized Scaffolding)
Scale AI runs every model through identical tooling with a 250-turn limit. This isolates raw model capability from scaffolding quality. SEAL stands for Scale's Evaluation and Assessment Lab. Scores below are from the public set (731 tasks).
The top 6 models are separated by 4.9 percentage points. Confidence intervals overlap for most adjacent pairs, meaning ranks 2 through 6 are statistically close. The gap widens below rank 10, where models drop below 30%.
| Rank | Model | Score | CI |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 45.9% | ±3.60 |
| 2 | Claude Sonnet 4.5 | 43.6% | ±3.60 |
| 3 | Gemini 3 Pro | 43.3% | ±3.60 |
| 4 | Claude Sonnet 4 | 42.7% | ±3.59 |
| 5 | GPT-5 (High) | 41.8% | ±3.49 |
| 6 | GPT-5.2 Codex | 41.0% | ±3.57 |
| 7 | Claude Haiku 4.5 | 39.5% | ±3.55 |
| 8 | Qwen3 Coder 480B | 38.7% | ±3.55 |
| 9 | MiniMax 2.1 | 36.8% | ±3.55 |
| 10 | Gemini 3 Flash | 34.6% | ±3.55 |
| 11 | GPT-5.2 | 29.9% | ±2.15 |
| 12 | Kimi K2 Instruct | 27.7% | ±3.25 |
| 13 | Qwen3 235B | 21.4% | ±2.25 |
| 14 | GPT-OSS 120B | 16.2% | ±2.67 |
| 15 | DeepSeek V3p2 | 15.6% | ±2.63 |
| 16 | Gemma 3 27B | 11.4% | ±2.15 |
| 17 | Llama 3.1 405B | 11.2% | ±2.15 |
| 18 | GLM-4.6 | 9.7% | ±2.15 |
| 19 | Llama 4 Maverick 17B | 5.2% | ±1.24 |
| 20 | Codestral (2405) | 1.5% | ±1.51 |
Source: Scale AI SEAL Leaderboard. All models use uncapped cost with 250-turn limit unless noted. CI = 95% confidence interval.
Agent Systems Leaderboard (Custom Scaffolding)
Agent systems bring their own scaffolding: the framework wrapping the model, including tool access, context management, and turn limits. These scores are not directly comparable to SEAL scores because the scaffolding varies. All scores are on the SWE-Bench Pro public set (731 tasks).
The scaffolding gap is the most underappreciated finding in this data. Three different agent systems ran the same model (Opus 4.5), and their scores ranged from 49.8% to 51.8%. That 2-point spread, 15 tasks out of 731, comes entirely from how the agent manages context and tool calls.
| Agent | Base Model | Score | Source |
|---|---|---|---|
| GPT-5.3-Codex (CLI) | GPT-5.3-Codex | 57.0% | OpenAI |
| Auggie | Opus 4.5 | 51.8% | Augment Code |
| Cursor | Opus 4.5 | 50.2% | Cursor |
| Claude Code | Opus 4.5 | 49.8% | Anthropic |
Scores are self-reported by each team. Opus 4.5 scores 45.9% on SEAL but 49.8-51.8% with custom scaffolding, a 4-6 point lift from better context retrieval alone.
WarpGrep Impact on SWE-Bench Pro (Morph Internal)
Self-reported data
The scores below are from Morph's internal benchmark runs, not from the SEAL leaderboard or independent third parties. They show the effect of adding WarpGrep v2 as a search subagent to existing coding agents.
SWE-Bench Pro: With vs Without WarpGrep v2
Morph internal benchmarks, public set (731 tasks)
WarpGrep v2 adds 2.1-2.2 points to every model tested.
WarpGrep v2 is an RL-trained search subagent that runs in its own context window. It issues up to 8 parallel tool calls per turn and returns only the relevant file spans. The main coding model never sees files WarpGrep rejected, so its context stays clean.
With Opus 4.6, adding WarpGrep v2 cuts cost by 15.6% and time by 28%. The expensive model spends fewer tokens on search and more on code generation.
SWE-Bench Verified Leaderboard (2026)
SWE-Bench Verified is a human-validated subset of 500 Python-only tasks from the original SWE-Bench. It remains widely cited, but OpenAI has stopped reporting Verified scores after finding that every frontier model showed training data contamination on the dataset.
| Rank | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.5 | 80.9% |
| 2 | Claude Opus 4.6 | 80.8% |
| 3 | MiniMax M2.5 (open-weight) | 80.2% |
| 4 | GPT-5.2 | 80.0% |
| 5 | Gemini 3 Flash | 78.0% |
| 6 | GLM-5 | 77.8% |
| 7 | Claude Sonnet 4.5 | 77.2% |
| 8 | Kimi K2.5 | 76.8% |
| 9 | Gemini 3 Pro | 76.2% |
| 10 | GPT-5.1 | 74.9% |
| 11 | Grok 4 | 73.5% |
| 12 | Claude Haiku 4.5 | 73.3% |
| 13 | DeepSeek V3.2 | 73.0% |
| 14 | Claude Sonnet 4 | 72.7% |
| 15 | Qwen3 Coder Next | 70.6% |
Verified scores are self-reported by model providers. Scaffold and harness differences affect results. Source: aggregated from swebench.com and provider announcements.
SWE-Bench Variants Comparison
SWE-Bench has expanded into multiple benchmark variants. Each targets different aspects of software engineering evaluation.
| Variant | Tasks | Languages | Top Score | Status |
|---|---|---|---|---|
| Original (Full) | 2,294 | Python | ~65% | Active |
| Lite | 300 | Python | ~55% | Active |
| Verified | 500 | Python | 80.9% | Contaminated |
| Pro | 1,865 | Py, Go, TS, JS | ~59% | Active (recommended) |
| Multilingual | 300 | 9 languages | ~45% | Active |
| Live | 1,565+ | Multiple | ~40% | Monthly updates |
Top scores are approximate and represent the best-known agent system result for each variant. "Contaminated" means OpenAI confirmed that frontier models have been trained on the test data.
SWE-Bench Pro vs SWE-Bench Verified
SWE-Bench Verified was the previous gold standard: a human-validated subset of 500 tasks from the original SWE-Bench. Pro was designed to fix its limitations.
| Dimension | SWE-Bench Verified | SWE-Bench Pro |
|---|---|---|
| Tasks | 500 | 1,865 |
| Repositories | 12 (all Python) | 41 (Python, Go, TS, JS) |
| Avg lines changed | 11 (median: 4) | 107.4 |
| Avg files changed | ~1 | 4.1 |
| Top score (Mar 2026) | 80.9% (Claude Opus 4.5) | ~59% (agent systems) |
| Contamination resistance | Low: all public repos | High: GPL + proprietary code |
| Task clarity | Ambiguous issues removed | Ambiguous issues clarified with human context |
The difference in task complexity is stark. 161 of SWE-Bench Verified's 500 tasks require only 1-2 lines of change. Every SWE-Bench Pro task requires at least 10 lines. Over 100 tasks require more than 100 lines. These are tasks that would take a professional engineer hours to days.
Contamination confirmed
OpenAI's audit found that every frontier model tested (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce verbatim gold patches or problem statement specifics for certain SWE-Bench Verified tasks. They also found that 59.4% of the hardest unsolved problems had flawed test cases. OpenAI has stopped reporting Verified scores and recommends SWE-Bench Pro instead.
How SWE-Bench Pro Works
SWE-Bench Pro contains 1,865 tasks across 41 actively maintained repositories spanning Python, Go, TypeScript, and JavaScript. The tasks come from real commit histories: consecutive commits where one resolves a bug or adds a feature, paired with tests that demonstrate the fix.
Three Subsets
Public Set (731 tasks)
Tasks from 11 GPL-licensed repositories, openly available on HuggingFace. This is the primary evaluation target for all leaderboard submissions.
Commercial Set (276 tasks)
Tasks from 18 proprietary startup codebases, acquired through partnerships with Scale AI. Not publicly accessible, providing additional contamination resistance.
Held-Out Set (858 tasks)
Tasks from 12 repositories reserved for overfitting detection. Scale AI can release these to verify whether improvements on the public set generalize to unseen code.
Three-Stage Human Augmentation
Each task goes through a rigorous annotation process:
- Problem statement creation: original commit messages and issue discussions are synthesized into clear, structured descriptions
- Requirements definition: annotators create specification lists grounded in unit tests and gold patches (the reference solution code), detailing expected behavior without prescribing implementation
- Interface specification: class and function signatures are documented to prevent false negatives from naming mismatches
Evaluation methodology
Evaluation uses containerized, language-specific environments. Each task must pass fail2pass tests (tests that should fail before the fix and pass after, verifying the issue is resolved) and pass2pass tests (existing tests that must continue to pass, ensuring the fix does not break other functionality). Gold patches are validated across 3 test runs before inclusion in the benchmark.
Why Scores Are So Much Lower Than SWE-Bench Verified
The drop from 80.9% (Verified) to 57-59% (Pro) reflects four factors that compound on each other.
Multi-File Modifications
SWE-Bench Verified is largely a single-file benchmark. Most fixes touch one file with a few lines changed. SWE-Bench Pro tasks require coordinating changes across an average of 4.1 files. The agent needs to understand how a change in one file affects behavior in three others.
Longer Time Horizons
These are tasks that would take a professional engineer hours to days. The agent must maintain coherent plans across many steps, managing context and state throughout.
Codebase Complexity
Pro repositories are production systems: business applications, B2B services, developer tools. They have complex build systems, cross-cutting concerns, and domain-specific conventions that an agent must navigate.
Contamination Resistance
Models cannot rely on having seen the code before. The GPL licensing and proprietary repos mean agents must genuinely reason about unfamiliar codebases, not recall solutions from training data.
Failure mode analysis
Scale AI's analysis of agent trajectories reveals where models break down: semantic understanding failures (35.9% of Opus 4.1 failures), context overflow (35.6% of Sonnet 4 failures), and tool-use inefficiency (42% of smaller model failures). Context overflow is the dominant failure mode for the strongest models, which aligns with research showing coding agents spend 60%+ of their time searching for context.
Frequently Asked Questions
What is SWE-Bench Pro?
SWE-Bench Pro is a software engineering benchmark by Scale AI that evaluates AI coding agents on 1,865 long-horizon tasks from 41 real repositories across Python, Go, TypeScript, and JavaScript. Tasks require an average of 107 lines of changes across 4.1 files.
What is Claude Opus 4.5's SWE-Bench Pro score?
Claude Opus 4.5 scores 45.9% on the SEAL leaderboard with standardized scaffolding, the highest of any model. On SWE-Bench Verified (a different, contaminated benchmark), Opus 4.5 leads at 80.9%. When paired with WarpGrep v2 as a search subagent, Opus 4.6 reaches 57.5% on Pro (Morph internal benchmark).
What is GPT-5.3-Codex's SWE-Bench Pro score?
GPT-5.3-Codex scores 57% on SWE-Bench Pro according to OpenAI's published results. On the SEAL leaderboard with standardized scaffolding, GPT-5 (High) scores 41.8% and GPT-5.2 Codex scores 41.0%. The gap shows the impact of OpenAI's custom agent scaffolding vs. Scale AI's standardized environment.
What is the difference between SEAL scores and agent system scores?
SEAL scores use Scale AI's unified scaffolding (SEAL = Scale's Evaluation and Assessment Lab) with a 250-turn limit, providing a controlled comparison across models. Agent system scores use custom scaffolding (Auggie, Cursor, Claude Code, WarpGrep configurations) with specialized context retrieval and other optimizations. Agent systems consistently score 5-15 points higher than the same base model on the SEAL leaderboard.
How often is SWE-Bench Pro updated?
Scale AI adds new model evaluations to the SEAL leaderboard as they become available. The benchmark has a held-out set of 858 tasks that can be released to detect overfitting. Agent system scores are reported by individual teams and may update independently.
How does SWE-Bench Pro differ from SWE-Bench Verified?
SWE-Bench Verified has 500 Python-only tasks with small fixes (median 4 lines). SWE-Bench Pro has 1,865 multi-language tasks requiring substantial, multi-file modifications (average 107 lines across 4.1 files). Pro uses GPL licensing and proprietary codebases to resist data contamination. OpenAI has stopped reporting Verified scores due to confirmed contamination.
Is SWE-Bench Verified still useful?
SWE-Bench Verified still differentiates between weaker models and runs faster. But OpenAI's audit found that all frontier models are contaminated on it, and 59.4% of hard tasks have flawed tests. OpenAI has stopped reporting Verified scores. SWE-Bench Pro is a better measure of production readiness.
WarpGrep v2: Search Subagent for SWE-Bench Pro
WarpGrep v2 is the RL-trained search subagent that lifted every model it was paired with by 2+ points on SWE-Bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster.