Poster Tue, Jul 7, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Zhen Wang ⋅ Fan Bai ⋅ Zhongyan Luo ⋅ Jinyan Su ⋅ Kaiser Sun ⋅ Xinle Yu ⋅ Jieyuan Liu ⋅ Kun Zhou ⋅ Claire Cardie ⋅ Mark Dredze ⋅ Eric Xing ⋅ Zhiting Hu

Abstract

Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either rely on LLM-as-judge evaluations of automatically generated papers, or optimize isolated performance metrics that provide only coarse proxies for scientific insight. To address this, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question from a published study and must autonomously design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier model backbones, such as gpt-5, on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success, exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. Overall, FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.