Poster Thu, Jul 9, 2026 • 5:00 PM – 6:45 PM KST Coex: HALL A

How Far Can LLM Agents Reason with Tables? Benchmarking Multi-Turn Agentic Table Question Answering in the Wild

Jingwang Huang ⋅ Jie Zhang ⋅ Haoyang Zeng ⋅ Changzai Pan ⋅ Xianjie Wu ⋅ Guanting Dong ⋅ Jiaheng Liu ⋅ Wei Zhang ⋅ Mingyu Zheng ⋅ Chunxiao Liu ⋅ Kaiwen Wei ⋅ Jiang Zhong ⋅ Jian Yang

Project Page

Abstract

Recent advances in large language models (LLMs) have substantially expanded the scope of Table Question Answering (TableQA). However, existing benchmarks primarily treat TableQA as a passive, single-turn natural language understanding task, lacking the capacity to evaluate autonomous reasoning and tool-call trajectories in realistic, multi-turn scenarios. To bridge this gap, we introduce TableAgent-Bench, a large-scale bilingual benchmark that reformulates TableQA as proactive, agentic interactions over structurally complex, multi-table environments. With a topology-aware construction strategy, TableAgent-Bench captures dynamic intent evolution through 1,310 multi-turn dialogues grounded in 2,275 industrial tables. Furthermore, we propose the Table-centric Agent Evaluation Framework (TAEF) to assess agent interactions with complex table structures. Specifically, TAEF integrates a specialized agent toolset and 4 metric categories to systematically diagnose intermediate failure modes, assessing performance across table localization, tool-invocation rationality, and trajectory-level pass rate. Extensive experiments with 25 state-of-the-art LLM agents reveal a substantial capability gap, with even the strongest model Gemini-3-Pro-Preview achieving only 53.4% information coverage. We expect TableAgent-Bench to serve as a rigorous testbed for developing and evaluating agents capable of robust table-centric reasoning.