Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Thu, Jul 9, 2026 • 7:00 PM – 8:00 PM PDT

Certifiable Evaluation: A Low-Rank Framework for Foundation Model Benchmarking with Formal Performance Guarantees

Siddharth Karuturi ⋅ Kaustubh Bukkapatnam ⋅ Laksh Patel ⋅ Tanush A Shastry ⋅ Akshath Sharma ⋅ Mithil Shah ⋅ Matthew Park

Project Page

Abstract

Benchmark scores for foundation models are point estimates that carry no formal guarantees about performance on unseen tasks. We address this gap by developing a rigorous theory of certifiable evaluation grounded in a Latent Capability Model (LCM): the $n \times m$ matrix of model--task performance scores is posited to have rank $r \ll \min(n, m)$, with each score decomposing as an inner product of a model capability vector and a task requirement vector, perturbed by sub-Gaussian noise. Within this framework we prove (i) an information-theoretic lower bound showing that any benchmark with fewer than $r$ tasks cannot produce non-trivial performance certificates for unseen tasks; (ii) an oracle PAC certificate that translates benchmark scores into prediction intervals for unseen tasks, with width scaling as $\mathcal{O}\left(\sigma \sqrt{k}/\sigma_{\min}(V_B)\right)$; (iii) a D-optimal benchmark selection criterion---the Minimum Sufficient Evaluation Set (MSES)---that maximises certificate quality for a given budget, with a greedy algorithm achieving a $(1 - 1/e)$-approximation. We validate all theoretical claims on a cross-benchmark performance matrix comprising 157 publicly available language models evaluated across a unified 86-task cross-benchmark matrix sampled from six benchmark suites whose full task counts total 376. Empirically, the effective rank is $r_{\mathrm{eff}} = 8$ despite 86 observed tasks, implying that the full benchmark suites carry substantial per-suite redundancy (82--95\%, consistent with Table 1). MSES achieves nominal certificate coverage to within 0.8 percentage points at $1 - \delta = 0.95$, while halving certificate width relative to random task selection.