SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs
Yadi Cao ⋅ Sicheng Lai ⋅ Jiahe Huang ⋅ Yang Zhang ⋅ Zach Lawrence ⋅ Rohan Bhakta ⋅ Izzy Thomas ⋅ Mingyun Cao ⋅ Chung-Hao Tsai ⋅ Zihao Zhou ⋅ Yidong Zhao ⋅ Hao Liu ⋅ Alessandro Marinoni ⋅ Alexey Arefiev ⋅ Rose Yu
Abstract
Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in scientific simulations. SimulCost compares LLM tuning cost-sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single-round (initial guess) and 1,900 multi-round (adjustment by trial-and-error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform-independent. Frontier LLMs achieve 46--64\% success rates in single-round mode, dropping to 35--54\% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi-round mode improves rates to 71--80\%, but LLMs are 1.5--2.5$\times$ slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in-context examples and reasoning effort, providing practical implications for deployment and fine-tuning. We open-source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost-aware agentic designs for scientific simulations, and for expanding new simulation environments.
Successful Page Load