$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
Quan Shi ⋅ Alexandra Zytek ⋅ Pedram Razavi ⋅ Karthik Narasimhan ⋅ Victor Barres
Abstract
Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on acquiring and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use in isolation, and rarely test whether agents can operationalize non-parametric knowledge to drive outcomes over long-horizon conversations. To remedy this, we introduce $\tau$-Knowledge, an extension of $\tau$-Bench that evaluates agents in environments where task success requires retrieving, reasoning over, and applying knowledge from a natural-language corpus. Our new domain, $\tau$-Banking, models realistic fintech customer support workflows in which agents must coordinate external knowledge with tool outputs to deliver verifiable, policy-compliant state changes over long-horizon conversations. $\tau$-Knowledge is substantially difficult: frontier models with high reasoning budgets only reach $\sim$21\% \passhat{1}, with reliability degrading sharply over repeated trials. We hope $\tau$-Knowledge provides a realistic testbed for developing conversational agents that integrate non-parametric knowledge in human-facing deployments.
Successful Page Load