Certifying Capabilities from Finite Tests: When Is It Possible?
Changlong Wu ⋅ Jin Sima ⋅ Wojciech Szpankowski
Abstract
Modern foundation models are evaluated through broad capabilities such as arithmetic, reasoning, safety, and robustness, yet it remains unclear in a principled sense when *finite tests* can meaningfully certify such claims. We develop a rigorous theory of capability evaluation by formalizing evaluation as inference over a task family and asking when guarantees over the full family can be inferred from a strict subset of tests. We analyze two canonical regimes. In stochastic multi-environment evaluation, we characterize when uniform certification is possible across multiple environments and show that the sample complexity is governed by a $\chi^2$-radius of the environment family, yielding near-optimal evaluation protocols with matching lower bounds under a natural overlap condition. In contrast, for worst-case, rule-like capabilities, we establish fundamental impossibility results. Even for structured model classes such as Boolean circuits of bounded size, black-box evaluation cannot, in general, certify global properties. Together, these results provide a principled framework for understanding when finite evaluation can and cannot certify capabilities.
Successful Page Load