June 14, 2026 ยท 5 min read
A more accurate benchmark for coding agents - SWE-Bench Pro
SWE-Bench Pro offers a more accurate, lower success rate benchmark for coding agents. But real-world coding demands tool-specific, iterative workflows beyond isolated tests.
SWE-Bench Pro is a critical step towards more realistic evaluation of coding agents, revealing significantly lower, and frankly, more believable success rates than its predecessors. This benchmark finally provides a clearer picture of agent capabilities, shifting the conversation from inflated completion percentages to the actual, often messy, reality of AI-assisted development.
The Benchmark Reality Check: SWE-Bench Pro's Harsh Truth
Previous coding benchmarks, like the original SWE-Bench, touted impressive agent completion rates, often hitting 80% or higher. Anyone actually shipping code with these tools knew that number was a fantasy. You'd feed an agent a problem, cross your fingers, and maybe, maybe, get something usable. SWE-Bench Pro cuts through the hype. It evaluates agents against a broader, more complex set of tasks drawn from real-world repositories, featuring larger codebases and a wider range of programming languages. The result? GPT-5, a top-tier model, scores around 36%. That feels right. It's a stark, honest correction to the over-optimistic figures previously thrown around. We needed this gut-check. It forces us to confront what these agents can and can't do, rather than chasing inflated metrics.
Benchmarks Miss the Point: The Iterative Workflow
Here's the rub: even a more accurate benchmark for coding agents โ SWE-Bench Pro included โ still fundamentally misunderstands how developers work. We don't just dump a problem on an agent and expect a perfect solution. Development is iterative. Itโs a back-and-forth, a refinement loop. Benchmarks typically measure a single-shot success: agent gets problem, agent solves problem, agent passes tests. That's not real life. In the trenches, you're giving partial instructions, getting partial solutions, debugging, adjusting prompts, and guiding the agent. The "doesn't use any of the AI coding tool that we actually use to benchmark" criticism against SWE-Bench still holds some water, even for its Pro version. The real value of an agent often lies in its capacity for intelligent interaction, not just isolated task completion.
The Context Problem: Why "Fix This" Fails
Coding agents, especially in frontend development, live or die by the context you give them. Ask an agent to "fix the button," and you'll get a generic response. Ask it to "update the primary-cta button, which is an instance of src/components/ui/Button.tsx, currently displaying 'Submit' and located at x: 120, y: 345 on the page, to use the isLoading prop from the Form context when formState.isSubmitting is true," and you're in a different league. The problem isn't always the agent's intelligence; it's often the developer's inability to provide sufficiently precise, structured context. Benchmarks rarely account for this granular, visual, and structural context that's second nature to a human developer but alien to a text-only prompt. This is where tools designed for contextual precision become invaluable. You can't expect a 36% success rate to improve if your input is vague. You need to tell the agent exactly what element, what file, what state. This is why we built markagent. It captures that pixel-precise, DOM-aware context, turning ambiguous requests into actionable agent prompts.
Beyond Synthetic: The Rise of Tool-Integrated Benchmarks
The industry's catching on. Some teams recognize the limitations of purely synthetic benchmarks. The OpenCode team, for instance, is actively working on benchmarks "much more closely aligned with actual day to day software development activities." This is the right direction. We need evaluations that reflect how developers interact with agents within their IDEs, within their browsers, and within their existing workflows. GosuCoder and his gosuevals.com are another example, explicitly testing agents "in actual coding tools." This isn't just about problem-solving; it's about integration, usability, and the agent's ability to operate within the constraints of a real development environment. Terminal bench, notably used by Anthropic for its Haiku 4.5 model, also moves closer to this real-world interaction, though it still has its critics regarding fidelity to actual tooling. These approaches, while still evolving, offer a more pragmatic lens than isolated, academic tests.
Your Toolchain Defines Agent Performance
A model's raw benchmark score, even on SWE-Bench Pro, tells you little about its performance in your specific toolchain. A model like Codex might perform differently depending on the IDE, the extensions, or even the underlying operating system. Some users report that while a model like GPT-5 mini is "damn fast," it's "clueless for context and double checking itself." This isn't just about the model; it's about how that model integrates with your Copilot setup, your VS Code environment, or your custom scripts. The cost of API calls also becomes a critical factor that benchmarks often ignore. An open-source model might score slightly lower on a benchmark but be vastly more cost-effective and perform just as wellโor betterโwhen integrated into a tight feedback loop with a developer. You can't just pick the top model from a leaderboard; you have to test it in your context, with your code, using your existing tools.
Prompt Engineering: The Missing Variable in Benchmarks
The quality of the prompt is the single biggest determinant of an agent's success. Benchmarks typically provide a standardized problem description. Real-world development is anything but standardized. Developers craft prompts based on their understanding of the code, the problem, and the agent's capabilities. A poorly articulated prompt will yield poor results, regardless of the agent's underlying intelligence. This is especially true for complex tasks where a single "fix this" won't cut it. Forget "fix this button." You need to specify "the Button component in src/components/ui/Button.tsx with className="primary-cta" visible at X,Y coordinates, currently displaying 'Submit' and failing to trigger onClick when clicked.' That's the difference. Tools that capture this granular context, like markagent, turn vague requests into actionable instructions, fundamentally altering the agent's ability to succeed. They bridge the gap between a human's visual understanding and an AI's textual input requirements.
The Future is Interactive, Not Just Automated
The takeaway from SWE-Bench Pro's more grounded numbers and the community's push for "tool-specific" benchmarks is clear: the future of coding agents isn't about full automation, it's about powerful collaboration. We're not waiting for an agent to hit 100% on some isolated test. We're building workflows where agents extend our capabilities, performing repetitive tasks, suggesting solutions, and catching errors, all within a tight, interactive loop. The best agents won't just solve problems; they'll help us define them, explore solutions, and implement them faster. Benchmarks need to evolve to measure this human-agent synergy, not just a model's isolated problem-solving prowess.
Stop chasing benchmark scores. Build a workflow where your agents actually help you ship code.