The Sobering Numbers
If you've been following the agent hype cycle, ITBench-AA is going to feel like cold water. Artificial Analysis and IBM just dropped the first benchmark for agentic enterprise IT tasks, starting with Site Reliability Engineering—and the best frontier model, Claude Opus 4.7 (Adaptive Reasoning, Max Effort), tops out at 47%. That's right: the state of the art can't even break 50% on real Kubernetes incident response.
GPT-5.5 (xhigh) sits at 46%, Qwen3.7 Max at 42%. This isn't a saturated benchmark where we're splitting hairs over decimal points. This is a benchmark that shows how far we still have to go before agents can reliably do the kind of work SREs get paged for at 3 AM.
For context, frontier models score "considerably higher" on Terminal-Bench. ITBench-AA SRE is designed to be hard—and it's working.
What ITBench-AA Actually Tests
The benchmark consists of 59 SRE tasks: 40 public, 19 held-out. Each task presents a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model's job is to identify the minimal set of independent root-cause Kubernetes entities responsible for the incident.
These aren't toy problems. Faults span typical SRE failure modes:
- Resource quota exhaustion
- Rollout failures
- Connection pool exhaustion
- Network partitions
- Chaos-injected incidents
Models run in Stirrup, an open-source agentic harness, with shell access to a sandboxed filesystem containing the relevant logs and snapshots. They have 100 turns per task, 3 repeats per task. They investigate by running shell commands, then submit a structured JSON diagnosis identifying root-cause entities (Deployments, Services, Pods, etc.).
Scoring is unforgiving. If you miss any ground-truth root cause, you get 0.0 for that repeat. If you identify all of them, you're awarded a score equal to your precision: true positives divided by (true positives + false positives). The headline score is the average across all 177 attempts (59 tasks × 3 repeats).
The Turn-Count Trap
One of the most fascinating findings: more turns do not mean better answers. In fact, they often mean worse ones.
Gemini 3.1 Pro Preview averages 83 turns per task and scores 30%. GPT-5.5 (xhigh) averages 31 turns at 46%. Gemma 4 31B (Reasoning) averages 58 turns and scores 37%.
Why? Because models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives. The scoring rubric is recall-gated precision: identifying the correct root cause but adding the chaos-mesh controller that injected the fault, or listing every affected Pod instead of the misconfigured NetworkPolicy that blocked them, counts against you.
This is a feature, not a bug. Real SRE work demands precision. If your runbook says "restart these seventeen Pods," when the actual fix is "delete this one NetworkPolicy," you've wasted time and possibly made things worse.
A Worked Example
In one public task, the agent sees user-facing failures in the frontend path. It investigates the snapshot:
- Reviews alerts to identify the incident window
- Inspects traces and logs to narrow the failure to frontend traffic
- Uses topology to pin down affected services
- Examines Kubernetes manifests to find a network policy blocking the frontend
The successful diagnosis: otel-demo/NetworkPolicy/frontend-block-all-ports. One entity. Clean answer. Models that also flagged every Pod in the frontend service, or the monitoring stack that detected the issue, got penalized.
The Open Weights Surprise
GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, with Gemma 4 31B (Reasoning) at 37%—ahead of Gemini 3.1 Pro Preview at 30%.
More importantly, open weights models sit on the cost frontier. Gemma 4 31B (Reasoning) scores 37% at $0.14 per task, outperforming Gemini 3.1 Pro Preview ($2.23 per task, 30%) on both score and cost. GLM-5.1 (Reasoning) scores 40% at $1.23 per task, matching Gemini 3.5 Flash (high) at $1.70 on score but at lower cost.
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads the leaderboard at 47%, but it's the most expensive at $5.38 per task. If you're running hundreds of incident diagnoses a week, that cost delta matters.
This is the kind of result that changes procurement conversations. Open weights aren't just "good enough for demos"—they're competitive on real enterprise workloads, and cheaper to run at scale.
What This Tells Us About Agent Readiness
The sub-50% ceiling is the headline, but the distribution is just as revealing. The gap between Claude Opus 4.7 at 47% and Gemini 3.1 Pro Preview at 30% is enormous—that's a 57% relative improvement. We're not in the "all models are basically the same" regime yet.
But even the leader is failing more than half the time. And these are structured, well-scoped tasks with clean ground truth. Real incident response involves ambiguous symptoms, incomplete data, and time pressure. If frontier models can't reliably nail the controlled version, production SRE agents are still a ways off.
That said, 40%+ on hard tasks is nothing to dismiss. These models are doing real investigative work—reading logs, tracing dependencies, correlating events across a distributed system. The limiting factor isn't basic competence, it's precision under the specific constraints of enterprise operations.
The Bigger Picture
ITBench-AA is just the start. Artificial Analysis and IBM plan to expand to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time. The underlying ITBench dataset comes from IBM's deep expertise in enterprise IT operations—this isn't synthetic or gamified.
For anyone building agent products aimed at enterprise IT, this benchmark is a gift. It's grounded in real failure modes, it's hard enough to show meaningful differentiation, and it's not saturated. You can actually move the needle.
For the rest of us watching the agent space, it's a useful corrective. The demos are impressive. The vibes are immaculate. But when you actually score frontier models on structured enterprise tasks with clear success criteria, you get numbers like 47%.
That's not a condemnation—it's a roadmap. We know what good looks like. We know where the gaps are. Now we build.
Further reading: