$\tau$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains
Soham Ray ⋅ Keshav Dhandhania ⋅ Victor Barres ⋅ Karthik Narasimhan
Abstract
Full-duplex voice agents—systems that listen and speak simultaneously—are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce $\tau$-voice, a benchmark for evaluating voice agents on grounded tasks with real-world complexity: agents must navigate complex multi-turn conversations, adhere to domain policies, and interact with the environment. The framework extends $\tau^2$-bench into a novel voice agent benchmark combining verifiable completion of complex grounded tasks, full-duplex interaction, and realistic audio—enabling direct comparison between voice and text performance. A controllable and realistic voice user simulator provides diverse accents, realistic audio environments, and rich turn-taking dynamics; by decoupling simulation from wall-clock time, the user simulator can use the most capable LLM without real-time constraints. We evaluate task completion (pass@1) and voice interaction quality across 278 tasks: while GPT-5 (reasoning) achieves 80\%, voice agents reach only 29--42\% under clean conditions and 19--30\% under realistic conditions with noise and diverse accents—a 50--61pp gap; qualitative analysis confirms 75--90\% of failures stem from agent behavior, suggesting that observed failures primarily reflect agent behavior under our evaluation setup. $\tau$-voice provides a reproducible testbed for measuring progress toward voice agents that are natural, conversational, and reliable.
Successful Page Load