AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition
Ruipeng Wang ⋅ Yuxin Chen ⋅ Yukai Wang ⋅ Chang Wu ⋅ Junfeng Fang ⋅ Xiaodong Cai ⋅ Qi GU ⋅ Hui Su ⋅ An Zhang ⋅ Xiang Wang ⋅ Xunliang Cai ⋅ Tat-Seng Chua
Abstract
As LLM-based agents are increasingly deployed in real-world workflows, existing agent benchmarks---often built on idealized, noise-free assumptions---fall short of characterizing agents' robustness under imperfect user instructions and unreliable tool feedback. To address this gap, we introduce **AgentNoiseBench**, a systematic evaluation framework for *interactive noise robustness* in LLM agents. AgentNoiseBench models two primary noise sources: *user-side instruction noise* arising from ambiguity and variability in human requests, and *tool-side result noise* caused by failures, partial outputs, and erroneous or distracting tool responses. The benchmark covers two representative agentic settings: (i) *multi-step tool use* with DeepSearch-style retrieval agents on multi-hop QA tasks, and (ii) *multi-turn user--agent interaction* via adaptations of $\tau^{2}$-Bench and VitaBench to support controlled noise injection. We further provide a modular noise injection pipeline with controllable location and intensity, together with multi-dimensional metrics that go beyond final success to capture degradation trends, decision instability, and compute overhead. Evaluating 25 tool-using models across reasoning and non-reasoning families, we find that tool-side noise generally induces substantially larger performance degradation and trajectory drift than user-side noise, and that some strong reasoning models exhibit a "reasoning trap", spending markedly more tokens and steps under corrupted tool feedback while still making confident errors. Overall, AgentNoiseBench provides a practical testbed for diagnosing failure modes and advancing robust agent design for real deployments.
Successful Page Load