SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks
Abstract
Pairwise model comparisons drawn from foundation-model benchmarks are often read as quantitative verdicts, yet they can hinge on under-specified harness choices such as prompt templates, decoding settings, few-shot levels, scoring rules, and quantization. We close one theory–benchmark loop on this primitive by introducing a finite-envelope proposition that links a measurable pairwise-disagreement rate to the existence of configuration-pair strict reversals. We pair this test with a commit-stamped evaluation protocol over widely cited alignment-related benchmarks. Across TruthfulQA, BBQ, ToxiGen, CrowS-Pairs, and XSTest, configuration choice alone can flip pairwise model verdicts within the tested envelope. The resulting operator-controllable rank-flip metric isolates a concrete strict-reversal failure mode and shows that claims such as “model A is safer than model B” are properties of the model–harness pair, not of the model alone