ShiftBench: A Benchmark for Per-Cohort Certify-or-Abstain Decisions on Positive Predictive Value Under Covariate Shift
Abstract
Many machine learning benchmarks emphasise aggregate accuracy on a single dataset, while formal statistical confidence in each cohort's performance under deployment shift is rarely the primary unit of reporting. Standard evaluation tools typically assume independently and identically distributed sampling, an assumption that has been observed to fail in a range of real deployment settings, and the question of which shift-aware tools are appropriate for per-cohort hypothesis testing on aggregate metrics does not appear to have been fully characterised in the literature we surveyed. We present ShiftBench, a benchmarking protocol that takes a trained classifier, a calibration set, and a target distribution and returns a per-cohort certify-or-abstain decision on positive predictive value (PPV) with control over the family-wise false-certification rate. Across the 38 real datasets and 9 candidate methods we evaluated, ShiftBench observed no false certifications in 19,600 boundary-null trials and provides a tentative ranking of methods by reliability and statistical power. As part of a methodological audit, our analysis suggests that weighted conformal prediction may be anti-conservative if its lower-bound formula is used directly as a hypothesis test on PPV, and identifies alternative bounds that appear to retain validity under the protocol.