InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation
Abstract
While Large Language Models (LLMs) hold promise for automating science and education, generating interactive scientific demonstrations demands a complex synthesis of deep domain knowledge and precise reactive coding. Current benchmarks fail to capture this synergy, largely bifurcating into static code generation or text-only reasoning. To address this, we introduce \textsc{InteractScience}, the first benchmark dedicated to evaluating the holistic creation of interactive scientific applications. We propose a novel hybrid framework that integrates programmatic functional testing for logic verification with visually-grounded qualitative assessment for rendering fidelity. Our evaluation of 30 leading models across five disciplines reveals critical gaps in grounding scientific reasoning within interactive interfaces. By standardizing this combined capability, \textsc{InteractScience} establishes a crucial foundation for reliable AI-driven tools in science and education.