Estimating Pass@$k$ from Fewer Samples with Hierarchical Bayesian Priors
Alexandre Verine ⋅ Florian Le Bronnec ⋅ benjamin negrevergne ⋅ Alexandre Allauzen
Abstract
Large Language Models are commonly evaluated on coding tasks using sampling-based metrics such as Pass@$k$, the probability of generating at least one correct solution after $k$ independent generations. Estimating Pass@$k$ curves from limited evaluation samples is important for benchmark design and stress testing, but can require many generations per task when per-sample success probabilities are small. We study this low-evaluation-budget regime using standard empirical-Bayes hierarchical priors over task-level success probabilities. The resulting posterior-predictive estimators pool information across tasks to estimate dataset-level Pass@$k$ curves and to diagnose when additional sampling is likely to help. We also study a Beta--Binomial improvability diagnostic, $\Delta\mathrm{Pass}$, whose interpretation is tied to the fitted-prior approximation. Across CodeContests, MPBB, and HumanEval, the experiments show complementary regimes: low-pass@1 tradeoffs, high-pass@1 Pareto frontiers, and a near-zero boundary-mass setting where explicit zero inflation is particularly informative.
Successful Page Load