The Benchmark That Exposes What Coding Agents Can't Do Yet
IBM Research just released ScarfBench, and it's the kind of benchmark that makes you rethink what "AI can code" actually means. The premise is simple: take real enterprise Java applications and ask AI agents to migrate them across frameworks—Spring to Jakarta EE, Quarkus to Spring, the full matrix.
The results? Even frontier agents achieve less than 10% behavioral success on whole-application migrations. Not "the code looks weird" failure. Not "it doesn't compile" failure. Complete, end-to-end failure to produce a working application.
This isn't because the agents are bad at writing Java. It's because framework migration isn't a coding problem—it's a systems problem. And that distinction matters more than almost anything else happening in AI-assisted software engineering right now.
Why This Benchmark Is Different
Most software engineering benchmarks measure whether generated code matches a reference implementation or passes unit tests. ScarfBench measures something harder: whether the migrated application actually builds, deploys, and preserves behavior.
The benchmark includes 34 applications, 102 framework implementations, and 204 migration tasks spanning approximately 151K lines of code across Spring, Jakarta EE, and Quarkus. That's not toy examples—these are real enterprise patterns with dependency injection, persistence configuration, build systems, and runtime dependencies.
The evaluation pipeline doesn't stop at "does it compile." Applications must:
- Build successfully
- Deploy correctly to a container
- Pass 1,331 expert-written behavioral tests
This three-stage gate is what separates benchmarks that measure code generation from benchmarks that measure actual software engineering capability. And it's where current agents fall apart.
The Compile → Deploy → Test Cliff
Here's the pattern that shows up across every agent they tested: compile success consistently exceeds deploy success, which consistently exceeds behavioral success. The drop-off at each stage is dramatic.
Claude Code, for example, reported successful builds for 29 out of 30 whole applications. Independent verification found only 22 actually built. The single application the agent classified as failed? It built correctly.
This isn't just a calibration problem—it's evidence that agents don't have reliable internal models of when a migration is actually complete. They're pattern-matching success signals (no error output, build commands exit cleanly) without understanding whether the resulting artifact works.
Migration difficulty also varies wildly by target framework. Jakarta EE migrations proved particularly challenging, likely because Jakarta's XML-heavy configuration and explicit descriptor files require more precise semantic translation than annotation-based frameworks.
What Agent Traces Reveal About Migration
ScarfBench tracked which application layers agents visited during migration attempts. The most frequently visited layers were configuration, web, database, and service—with common transitions like configuration ↔ web and service ↔ database.
This tells you something important: migration isn't a linear source-to-source transformation. It's an iterative dependency-resolution process. Change the database layer, discover you need to update service configuration, realize the web layer depends on both, circle back to configuration.
Agents repeatedly returned to configuration-related artifacts while resolving framework differences. Configuration, in other words, dominates migration effort. This makes sense if you've ever actually migrated a Java app—the hard part isn't rewriting @Inject as @Autowired, it's making sure the entire dependency graph boots correctly in the new runtime.
The Problems That Aren't About Code
One of the most revealing findings: agents frequently struggled with environmental issues that have nothing to do with Java source code.
Docker cache inconsistencies. Port connectivity problems. Maven wrapper and build tooling issues. These operational concerns delayed validation even when the source-code migration itself was largely complete.
The failure mode distribution is telling: modernization failures span build systems, deployment environments, dependency injection, databases, endpoints, assertions, and infrastructure. Only a fraction of failures are "the agent generated wrong code." Most are "the agent didn't understand how all these pieces fit together."
This is the gap between coding agents and software engineering agents. Coding agents operate in the abstraction layer where source files are the universe. Software engineering requires navigating build systems, deployment topologies, and runtime environments.
What This Means for AI-Assisted Modernization
The biggest challenge in framework modernization is not translating Java code. It's managing the web of dependencies across configuration, infrastructure, and runtime environments.
That doesn't mean agents are useless for migration work—it means we're asking them to do the wrong job. Agents that can automate the mechanical transformation of annotations and imports are useful. Agents that claim they can autonomously modernize enterprise applications are overselling.
The current agent architectures don't have robust architectural reasoning. They don't build causal models of how configuration changes propagate through layers. They don't reliably detect when a migration is actually complete.
This is fixable, but it requires different designs than "throw more context at the LLM." Multi-agent architectures with specialized validators, explicit dependency graph reasoning, and better environment modeling are the obvious next research directions.
Why This Benchmark Matters Beyond Java
ScarfBench is nominally about enterprise Java, but the lesson generalizes. Any benchmark that evaluates AI agents on real software engineering tasks needs to measure end-to-end outcomes, not intermediate artifacts.
Does it build? Does it deploy? Does it preserve behavior? These questions separate agents that generate plausible-looking code from agents that actually ship.
The benchmark is fully open: dataset, evaluation infrastructure, public leaderboard, source code. Researchers can compare agent architectures. Practitioners can evaluate modernization tools before deploying them in production.
Framework migration remains one of the largest unsolved problems in AI-assisted software engineering. ScarfBench gives us a way to measure progress honestly—and right now, that measurement says we're much further from autonomous modernization than the demos suggest.
If you're building coding agents, this is the kind of benchmark you should be sweating. Not because it's unfair, but because it measures the thing customers actually care about: does the migrated application work?