How Far Can LLM Agents Reason with Tables? Benchmarking Multi-Turn Agentic Table Question Answering in the Wild
Abstract
Recent advances in large language models (LLMs) have substantially expanded the scope of Table Question Answering (TableQA). However, existing benchmarks primarily treat TableQA as a passive, single-turn natural language understanding task, lacking the capacity to evaluate autonomous reasoning and tool-call trajectories in realistic, multi-turn scenarios. To bridge this gap, we introduce TableAgent-Bench, a large-scale bilingual benchmark that reformulates TableQA as proactive, agentic interactions over structurally complex, multi-table environments. With a topology-aware construction strategy, TableAgent-Bench captures dynamic intent evolution through 1,310 multi-turn dialogues grounded in 2,275 industrial tables. Furthermore, we propose the Table-centric Agent Evaluation Framework (TAEF) to assess agent interactions with complex table structures. Specifically, TAEF integrates a specialized agent toolset and 4 metric categories to systematically diagnose intermediate failure modes, assessing performance across table localization, tool-invocation rationality, and trajectory-level pass rate. Extensive experiments with 25 state-of-the-art LLM agents reveal a substantial capability gap, with even the strongest model Gemini-3-Pro-Preview achieving only 53.4% information coverage. We expect TableAgent-Bench to serve as a rigorous testbed for developing and evaluating agents capable of robust table-centric reasoning.