What ServiceNow Just Shipped
ServiceNow just released EVA-Bench Data 2.0, and it's a serious upgrade to voice agent evaluation infrastructure. The new release expands from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD).
Together they cover 213 evaluation scenarios across 121 tools—roughly a 4× increase in scenario coverage from the original release. Every scenario was validated for solvability against three frontier models: GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6. That validation step matters: it ensures the benchmark is both challenging and fair, not just a gotcha fest where even SOTA models fail arbitrarily.
All three datasets are MIT-licensed and available on Hugging Face. You can pull them down with a three-liner and start running evals immediately.
Why Domain Diversity Actually Matters
The insight driving this release is simple but underappreciated: voice agent failures are domain-specific. A system that flawlessly handles alphanumeric confirmation codes in flight rebooking might completely stumble on complex policy enforcement in HR workflows.
Different domains stress-test different capabilities. Vocabulary density, workflow complexity, and user expectations all vary. Airline CSM is heavy on structured entity transcription—confirmation codes, flight numbers, gate changes. ITSM tests multi-step troubleshooting and authentication elevation. Healthcare HRSD introduces real-world policy constraints: NPI numbers, FMLA regulations, insurance coverage rules.
This isn't just academic variety. If you're shipping a voice agent into production, you need eval coverage that reflects the actual distribution of failure modes you'll encounter. Single-domain benchmarks give you a narrow slice of that picture.
The Generation Pipeline: SyGra and Joint Consistency
EVA-Bench uses SyGra, a graph-based synthetic data generation pipeline, with GPT-5.4 as the backbone LLM. The interesting architectural choice here is joint generation: user goal, initial database state, and expected final database state are generated together in a single pass.
Why does that matter? Because independent generation creates silent inconsistencies that poison the evaluation signal. Imagine a user goal that references a booking ID that doesn't exist in the scenario database, or an expected outcome that's impossible given the initial state. Those bugs are hard to catch and completely corrupt your metrics.
Instead, SyGra generates all three components at once, then runs a multi-stage validation loop. Structural checks validate database schemas with Pydantic. LLM-based validators verify cross-component consistency: does the user goal match database records? Are authentication credentials correctly configured? Is there exactly one valid action sequence?
That last point is critical for reproducibility. If multiple valid paths exist to complete a scenario, your user simulator will make different choices across runs, and you'll get noisy eval signals that don't reflect actual capability differences.
User Goals as Decision Trees
The user goal design is worth unpacking. It's not a vague statement of intent like "the user wants to change their flight." That approach forces the simulator to improvise, and improvisation means inconsistency.
Instead, EVA-Bench user goals are structured as decision trees. They specify exactly what the user should ask for, plus a full negotiation sequence: when to push back, when to accept alternatives, when to escalate. Common edge cases—like whether to accept a standby flight or an alternate airport—are handled with explicit instructions.
Resolution conditions require evidence of a completed action: a confirmation number, a case ID, something concrete. Not just a verbal commitment. This keeps the simulator on the call until the action is actually confirmed, which mirrors real user behavior much more closely than a simulator that just takes the agent's word for it.
Scenario Types: Single, Multi, and Adversarial
EVA-Bench samples across three scenario types to avoid the trap of scaling by repetition. Single-intent calls are the baseline: one user goal, one workflow. Multi-intent calls bundle up to four intents in a single conversation, testing the agent's ability to context-switch and maintain state.
Adversarial scenarios are where it gets spicy. These are cases where callers attempt to bypass troubleshooting steps, misclassify urgency to jump the queue, or access records they're not authorized to view. In practice, models struggle more with policy enforcement under adversarial pressure than they do with happy-path execution.
Both single and multi-intent scenarios also include unsatisfiable goals—cases where the user's request legitimately can't be fulfilled. Real call volume isn't all happy-path, and agents need to handle "no" gracefully. In ServiceNow's experience, models tend to struggle more with unsatisfiable goals than with successful interactions, often hallucinating workarounds or ignoring policy constraints.
Authentication as a Consistent Failure Mode
Every domain includes authentication flows, and the design is calibrated to realism. OTP-based elevation appears where a production system would actually require it, not uniformly across all scenarios just for difficulty's sake.
This mirrors findings from prior work (including the original EVA-Bench and τ-Voice): authentication is one of the most consistent failure points for voice agents. It's a combination problem. You need accurate transcription of alphanumeric codes, correct sequencing of multi-step flows, and policy compliance around when elevation is required.
Get any piece wrong and the scenario fails. That makes authentication a useful diagnostic signal for overall agent robustness.
Validation: Three Frontier Models, Manual Review, and Solvability
After SyGra generation, every scenario went through multiple rounds of manual review. Reviewers checked that policies were applied consistently within each domain, that user goals admitted exactly one correct resolution, and that expected final states were internally consistent with both the user goal and the initial database.
Then came the solvability check. ServiceNow ran GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 on a text-only version of each scenario, bypassing the audio pipeline entirely and feeding in conversation transcripts directly.
For every scenario where any model scored zero on task completion, they manually investigated whether the failure reflected genuine model error or a dataset bug: ambiguous policy, under-specified user goal, tool executor issue, or database inconsistency. Records with identified dataset issues were corrected or removed.
The result: all selected samples are solvable by at least one frontier model. That's a strong fairness guarantee. You're not evaluating against an unreachable standard.
Domain Deep-Dive: ITSM and Healthcare HRSD
The two new domains each target a distinct failure mode axis. Enterprise ITSM (80 scenarios) is heavy on multi-step troubleshooting and authentication elevation. Think account lockouts, password resets, access requests—workflows where the agent needs to guide the user through a diagnostic tree and escalate at the right moment.
Healthcare HRSD (83 scenarios) introduces real-world policy constraints grounded in US healthcare administration. NPI numbers, FMLA regulations, insurance coverage rules. This domain tests whether the agent can navigate complex, legally-binding policies while maintaining conversational fluency.
Both require accurate transcription of structured entities over voice—employee IDs, case numbers, policy codes—but the primary challenge differs. ITSM is about sequencing and elevation. HRSD is about policy compliance under ambiguity.
Multilingual: Beyond English-Only Deployment
ServiceNow is previewing a multilingual extension that adapts not just conversation language but the entire evaluation pipeline to each target language and culture. That means localized names, phone numbers, email addresses, and location references.
An English scenario might reference "downtown" and "engineering center" with a user named Marcus Chen and a phone number like +1-512-555-0148. The French version replaces those with "centre-ville," "centre d'ingénierie," a user named Éric Nicolas, and a phone number like +33 6 19 41 27 70.
This isn't just cosmetic. Speech recognition accuracy, transcription fidelity, and conversational fluency can degrade in language-specific ways. A high-performing voice agent in English can fail completely when deployed in French or Mandarin. English-only eval gives you limited insight into that failure mode.
Beyond the dataset itself, ServiceNow is updating metrics and judges to build trustworthy evaluation across languages. That's the hard part: you need culturally-appropriate user simulation, language-specific policy interpretation, and localized ground truth validation.
What This Means for Voice Agent Builders
If you're building or buying a voice agent, EVA-Bench 2.0 gives you a realistic stress-test across multiple enterprise domains. You can run your agent against 213 scenarios spanning 35+ distinct workflows and get a signal on where it breaks.
If you're building your own eval dataset, the post describes the end-to-end generation and validation process in enough detail to serve as a practical reference. The joint-generation approach, decision-tree user goals, and multi-stage validation loop are all directly reusable patterns.
And if you're evaluating multilingual deployments, the coming extension will give you a much better picture of real-world performance than English-only proxies.
The benchmark is fully open-source under MIT license. All three datasets are live on Hugging Face now.