From Forecast Scores to Auditable Benchmarks: WorldFork for LLM Forecasting Evaluation
Hanson Wen ⋅ Shing C Gui
Abstract
Foundation-model forecasting benchmarks often report aggregate scores without specifying how uncertainty, leakage, endpoint semantics, or extraction choices affect whether a result should generalize. We introduce WorldFork as a benchmark-design case study for LLM forecasting agents: a public event card is converted into branching timelines with actor state, endpoint ledgers, path mass, unresolved mass, provenance, and a scoring-rule-compatible extraction rule. The central object is therefore not only a forecast probability, but an auditable record of how uncertainty moves through decomposition, branch policy, endpoint settlement, and report generation. On 24 masked retrospective resolved-event cards, unconditional branching reduces WorldFork Brier score from 0.282 to 0.214 and log score from 0.725 to 0.581; a fixed 50/50 blend with a direct JSON forecast reaches Brier 0.205. We treat these numbers as descriptive stress-test evidence, not a guarantee: retrospective masking only partially controls leakage, the exact sign test is suggestive but not significant ($p=0.064$), the paired bootstrap interval includes zero, and multiple comparisons were explored. The contribution is a guarantee-oriented benchmark protocol that makes pre-registration, leakage audit, uncertainty composition, and trace-level failure analysis explicit for future locked evaluations.
Successful Page Load