Expo Talk Panel
Advancing Frontier Scientific Capabilities, Today and Tomorrow
Caitlin Oriel
HALL C
Scientific benchmarks are saturating faster than the field can replace them. SciCode rose from 4.6% to 59% and HLE from 8% to 50% within a year, yet these gains show limited correlation with improvements in actual scientific workflows. The core issue is structural: current evaluations test isolated subtasks, while the highest-cost steps in scientific work (simulation setup, debugging, literature synthesis, replication) remain largely unmeasured.
This talk presents a framework for frontier scientific evaluation grounded in real-world deployment signals and workflow productivity. It rests on three components: deliverable-based task design, where the unit of evaluation is a work product rather than a final answer; productivity-oriented scoring, measuring time to reviewable draft, iteration efficiency, and salvageability under expert oversight; and substep decomposition, where tasks mirror real workflow stages and each substep is independently gradable for diagnostic signal and scalable verification.
We map O*NET task-importance data against benchmark coverage, showing that the most time-intensive workflow steps have minimal representation. We describe how real workflows in high-stakes enterprise domains such as semiconductor design and manufacturing can be captured and decomposed into structured evaluation tasks through domain collaboration, and present pilot findings comparing scientists completing multi-step tasks manually, with SOTA models, and with models fine-tuned on targeted scientific data.
Attendees will leave with the Signal-to-Value Ladder, six criteria for assessing whether an evaluation predicts real-world scientific and economic productivity. We close with a forward look: as sample-level data delivery approaches its economic ceiling, the field will need to shift from samples and datasets toward modular, composable capability infrastructure, much as software delivery evolved from project-level code to libraries and APIs.
Live content is unavailable. Log in and register to view live content