EVA-Bench 2.0: Three Domains, 213 Scenarios, and the Real Cost of Voice AI Eval
ServiceNow's new voice-agent benchmark spans airlines, IT, and healthcare—with joint-generation pipelines, adversarial scenarios, and a coming multilingual expansion.
4 posts
ServiceNow's new voice-agent benchmark spans airlines, IT, and healthcare—with joint-generation pipelines, adversarial scenarios, and a coming multilingual expansion.
The Open ASR Leaderboard is fighting back against benchmaxxing with a simple but effective strategy: private evaluation datasets that no one can train on.
TII launches QIMMA, a rigorous quality-focused leaderboard for Arabic LLMs that goes beyond translation metrics to measure genuine language understanding and cultural nuance.
Hugging Face just dropped Ecom-RLVE, a reinforcement learning framework that trains e-commerce agents in realistic but controllable environments. This is how we move from chatbots to actually useful shopping assistants.