Over half the world's population speaks more than one language. For many of those speakers, code-switching—fluidly mixing languages mid-sentence—is as natural as breathing. Yet we've been building voice agents as if everyone speaks one language at a time.
ServiceNow just shipped a benchmark specifically designed to test how ASR systems handle code-switched speech, and the results are fascinating. They evaluated seven frontier ASR models across Spanish-English, French-English, Canadian French-English, and German-English pairs, measuring not just transcription accuracy but whether errors propagate into downstream task failures.
This matters because ASR is the first domino in any voice agent pipeline. Get the transcript wrong, and everything downstream—intent classification, slot filling, routing—inherits that error. In enterprise settings like HR helpdesks or IT support, a mistranscribed case number or policy question has real operational consequences.
The Benchmark Architecture
ServiceNow built this from an internal corpus of IT and HR interactions. Their pipeline is clever: start with parallel utterances in English and a target language, filter for 12-40 word turns with at least three switchable content words, then use GPT-5 with a persona prompt to generate realistic code-switched versions.
They're explicit about avoiding the easy cases—utterances dominated by entities like emails or URLs don't count, since those are English by necessity rather than bilingual choice. The synthesized audio comes from ElevenLabs Multilingual V2, and every utterance gets reviewed by a native-speaker linguist before landing in the final dataset.
The result: 259 Spanish-English records, 298 French-English, 188 Canadian French-English, and 173 German-English. They released it through AU-Harness, their voice model evaluation harness.
Three Metrics, Three Lenses
Standard word error rate (WER) is the baseline. But ServiceNow added two semantic metrics that matter more for production systems:
Semantic WER (SWER) judges whether errors are semantically meaningful, using Gemma-4-31B as the evaluator. A typo that preserves meaning scores better than a hallucinated word.
Answer Error Rate (AER) is the functional test. For each utterance, they generate three downstream comprehension questions—can an LLM reading the transcript answer correctly? This directly measures whether transcription failures propagate into task failures. The methodology follows work from IISc/ARTPARK.
The gap between WER and AER reveals which models prioritize raw accuracy versus semantic preservation. That distinction turns out to be illuminating.
The Results: Who Won
ElevenLabs Scribe V2 and AssemblyAI Universal-3 Pro tied for transcription accuracy, separated by at most 0.13 percentage points across language pairs. Google Gemini 3 Flash followed closely, trailing by 0.12-0.14 points.
Deepgram Nova-3, Mistral Voxtral Small 24B-2507, and Nvidia Parakeet TDT 0.6b V3 clustered in the middle tier. Each pulled ahead on at least one language pair—Parakeet closed the gap on German-English specifically.
OpenAI Whisper Large V3 Turbo landed at the bottom with WER ranging from 0.16 to 0.61. But there's a twist: when called without an explicit language parameter on code-switched audio, Whisper defaults to translating everything into English rather than transcribing. It fails to preserve the matrix language entirely—a known limitation, but dramatic in this context.
Semantic Metrics Tell a Different Story
Under SWER and AER, the rankings shift. Scribe V2 still dominates, but Gemini 3 Flash consistently beats AssemblyAI on AER despite trailing slightly on raw WER. As a large audio language model, Gemini is optimized for language understanding—that architectural choice pays dividends when meaning matters more than character-perfect transcription.
Whisper's underperformance narrows considerably under semantic metrics. Its tendency to translate rather than transcribe means it preserves meaning even while failing the transcription task. Not ideal, but less catastrophic than raw WER suggests.
The one outlier: Deepgram Nova-3 sits mid-tier on SWER but drops to last or second-to-last on AER across all language pairs. Its overall semantic error rate is lower than its error rate specifically on the details that matter most—case numbers, dates, names. That's a debugging signal.
The Cost of Code-Switching
Here's the question I wanted answered: do these errors come from transcription difficulty generally, or from code-switching specifically?
ServiceNow isolated this by running every utterance through three conditions: code-switched, monolingual matrix-language, and monolingual English. The delta between code-switched and monolingual reveals the switching penalty.
Scribe V2, Gemini 3 Flash, and AssemblyAI showed the smallest deltas. Scribe notably outperformed its own L2 baseline in some cases, pointing to genuine robustness against bilingual input rather than just strong monolingual baselines.
A structural pattern emerged: degradation relative to English was almost always larger than degradation relative to the matrix language. That makes sense—the L2 baseline is already harder for most models, so the net switching penalty shrinks when measured against it.
Whisper showed the largest degradation relative to English, peaking at +0.85 on German-English. It's also the only model that performed better on code-switched speech than on monolingual L2—a direct artifact of its translation default.
What Breaks ASR on Code-Switching
ServiceNow fit a two-part model to understand failure modes: first, logistic regression to identify conditions associated with any error; second, analysis of what types of errors occur when the model does fail.
The source content cuts off before revealing full results, but the methodology is sound—this is the kind of error analysis that matters for production debugging. Understanding whether models fail at language boundaries, on low-frequency words in the embedded language, or when switching happens mid-phrase gives you actionable paths for data augmentation or prompting strategies.
Why This Benchmark Matters
Code-switching isn't an edge case. It's how billions of people actually talk. Contact centers, customer support, and enterprise helpdesks serve bilingual populations who don't wait for a language-selection menu before asking their question.
Until now, evaluating voice agents on code-switched speech meant rolling your own dataset or hoping your production errors surfaced the gaps. ServiceNow shipped a repeatable benchmark with semantic metrics that mirror real task performance.
The practical takeaway: if you're deploying voice agents for bilingual populations, ElevenLabs Scribe V2, Gemini 3 Flash, and AssemblyAI Universal-3 Pro are your top choices today. If you care more about semantic correctness than character-perfect transcription—and in most enterprise contexts you should—Gemini's LALM architecture gives it an edge despite slightly higher raw WER.
And if you're using Whisper in a code-switching context without explicit language hints, you're effectively running a translation model. That might be fine for some use cases, but know what you're getting.
The data and harness are open. Go benchmark your own stack.