FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
Abstract
Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either rely on LLM-as-judge evaluations of automatically generated papers, or optimize isolated performance metrics that provide only coarse proxies for scientific insight. To address this, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question from a published study and must autonomously design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier model backbones, such as gpt-5, on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success, exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. Overall, FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.