โšก Blog โšก
โœฆ VIBES โœฆ

Benchmarking AI Coding Agents on End-to-End Project Development

โ˜…โ˜…โ˜…โ˜…โ˜…VIBES.EXEBBS COMPATIBLE
AdGeneric annotators save images. Markagent ships the prompt for your AI agent.

May 24, 2026 ยท 6 min read

Benchmarking AI Coding Agents on End-to-End Project Development

New benchmark ProjDevBench tests AI coding agents on full project builds. See how they stack up and what it means for developers.

You've got a spec. Not a function spec, a whole damn project spec. You feed it to an agent. What happens? Does it build a mess or a masterpiece? For too long, we've been stuck evaluating AI coding assistants on tiny tasks: fix this function, write this snippet. It's like testing a chef by asking them to chop an onion. Useful, sure, but it doesn't tell you if they can cook a five-course meal. The real test is building something from the ground up. That's the gap ProjDevBench is finally trying to fill.

The Wild West of AI Code Generation: From Snippets to Full Projects

Let's be blunt. Most AI coding benchmarks are a joke for real-world development. HumanEval? APPS? They test function-level code generation. SWE-bench? It looks at fixing bugs in existing code. That's fine for learning basic syntax or finding a misplaced semicolon. But it doesn't touch the complexity of actual software development. We're talking about building entire applications, managing dependencies, structuring code across multiple files, and ensuring the whole damn thing runs.

The paradigm is shifting. We're moving towards what some call "vibe coding"โ€”feeding high-level requirements to an AI and expecting it to churn out a working system. This isn't just about generating code; it's about autonomous software development. It means an agent needs to figure out the project structure, create files, configure build tools, and deliver a runnable product. Existing benchmarks simply don't capture this. They evaluate fragments. ProjDevBench aims to evaluate the whole damn movie, not just a single scene. Itโ€™s about Benchmarking AI Coding Agents on End-to-End Project Development, and frankly, it's overdue.

ProjDevBench: A Reality Check for AI Coding Agents

So, what is ProjDevBench? Itโ€™s a new benchmark designed specifically for evaluating AI coding agents on building complete software projects from scratch. Forget pre-existing codebases or partial repos. ProjDevBench dumps a project requirement on the agent and expects a full repository in return. This is a critical distinction. Table 1 in the paper lays it out: most benchmarks require some starting code or focus on single files. ProjDevBench demands building from zero.

The benchmark curates 20 programming problems. These aren't just abstract algorithmic puzzles; they span 8 categories, covering both fundamental coding concepts and practical, real-world application scenarios. The goal is to test an agent's ability to handle the entire lifecycle of project creation, not just isolated code snippets. The output isn't a function; it's a complete, executable repository. This is the first step towards evaluating AI's capability for genuine end-to-end project development. The paper highlights that agents averaged 138 interaction turns and a staggering 4.81 million tokens per problem in their experiments. Thatโ€™s a lot of back-and-forth for something that's supposed to be autonomous.

How the Agents Really Stack Up: The ProjDevBench Findings

The results? Underwhelming, to put it mildly. The ProjDevBench evaluation reports an abysmal overall acceptance rate of just 27.38%. That means for every four projects an agent attempts, three fail. This isn't surprising if you've actually tried to build something complex with AI. They can do the boilerplate, maybe some basic CRUD operations. But ask them to design a system architecture? Forget it.

The paper points out specific failure modes. Agents struggle mightily with complex system design, time complexity optimization, and crucial resource management. These aren't minor bugs; these are fundamental architectural flaws. When things go wrong, 42% of failures are due to outright wrong answers, and another 14% hit time limits. That's over half the failures stemming from core functional or performance issues.

There are nuances, though. The evaluation looked at six different AI coding agents built on various LLM backends. Codex, paired with GPT-5, managed the best overall performance at 77.85%. But even that's not stellar, and the gap widens significantly on those "from scratch" tasks. Interestingly, GPT-5 seemed better at execution scores, while Anthropic's Sonnet-4.5 showed stronger performance in code review and specification compliance. This suggests different models have different strengths, but none are masters of the whole game.

Perhaps the most telling finding is about extended interaction. The paper notes that prolonged debugging sessions, indicated by a high number of interaction turns, actually correlate negatively with performance. Agents get stuck. They don't learn from their mistakes efficiently; they just spin their wheels. This isn't the autonomous developer we were promised. This is a buggy intern.

The Evaluation Toolkit: OJ Testing Meets LLM Code Review

How do you even evaluate a whole repository? ProjDevBench uses a two-pronged approach. First, there's traditional Online Judge (OJ) testing. This means compiling the code, running it against a battery of test cases, and checking for functional correctness. It's the standard way to see if code actually works. It catches bugs, incorrect logic, and performance bottlenecks that cause timeouts.

But OJ testing isn't enough. An agent could generate code that passes tests but violates project requirements, uses forbidden libraries, or is a thinly veiled copy of existing code. Thatโ€™s where the second layer comes in: LLM-assisted code review. This combines rule-based scripts to catch explicit violations with LLM-based review for subtler issues like code style, adherence to specifications, or detecting patterns that suggest cheating. It's about ensuring the code is not just functional but also compliant and well-designed according to the given rules. This dual approach is essential for any serious benchmarking of AI coding agents on end-to-end project development.

The "Vibe Coding" Workflow: Bridging the Gap

So, AI agents can generate code, and we have a benchmark to prove they're still pretty bad at it. Where does that leave us? It leaves us with the current reality: AI agents are tools, not replacements. The "vibe coding" future isn't here yet, not fully. But the workflow is emerging. You'll prompt an agent, it'll spit out a structure, maybe a significant chunk of code. Then what?

You still need to test it, debug it, and provide precise feedback for iterative solution refinement. The AI doesn't magically understand what's wrong with your UI, or why the user flow feels clunky. It needs context. This is where tools that capture developer intent and context become critical. You click that broken button on the webpage, drop a note, grab a screenshot. Markagent does exactly this. It captures not just the visual, but the DOM context, the stable CSS selector, the viewport, the page URL. It packages all that into a structured markdown prompt, ready for your AI agent. Itโ€™s the bridge between the AIโ€™s output and your feedback loop.

This level of detail is what's missing when agents fail on system architecture design or functional correctness. They can't fix what they don't understand. Giving them precise, contextual feedback is how we nudge them towards better outcomes. Without it, you're just throwing prompts into the void and hoping for the best.

What's Next? The Long Road to True Autonomous Development

ProjDevBench is a wake-up call. It shows we're not on the cusp of fully autonomous development. Agents are good at generating boilerplate, implementing straightforward algorithms, and filling in predictable gaps. But they falter when it comes to higher-level reasoning, strategic design decisions, and robust error handling. The path to true autonomous software development is longer and more complex than many assumed.

We need better agents, certainly. But we also need better benchmarks like ProjDevBench to guide that development. And crucially, we need better tools for collaboration. The future isn't just about AI building software; it's about humans and AI building software together. Tools that simplify prompt engineering, provide rich context, and facilitate rapid iteration will be essential companions on this journey. The current performance figures aren't the end of the story; they're just the beginning of understanding what it takes to build real software with AI.

P.S. โ€” markagent is the Chrome extension I use to ship pixel-precise UI feedback to AI coding agents. Free, local, no account.