HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?
Abstract
Recently, the physics reasoning capabilities of (M)LLMs have attracted growing attention. However, existing physics benchmarks suffer from two major gaps: they neither provide systematic and up-to-date coverage of physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. HiPhO highlights three key innovations. (1) Comprehensive data: it compiles 13 latest Olympiads from 2024-2025, covering both international and regional competitions and spanning mixed modalities from text-only to diagram-based problems. (2) Professional evaluation: it adopts official rubrics to perform fine-grained grading at both the answer and step levels, ensuring alignment with human examiners. (3) Human-level comparison: models are awarded gold, silver, and bronze medals based on official score thresholds, enabling direct comparison with human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that across 13 exams, most open-source MLLMs remain at or below the bronze level, open-source LLMs demonstrate notable progress with multiple gold medals, and closed-source MLLMs achieve 6-13 gold medals, while most models still fall well short of full marks. These results underscore the substantial gap between current (M)LLMs and top human contestants, as well as the considerable room for further improvement.