Selective Perturbations as a Diagnostic for Benchmark-Based LLM Comparisons
Abstract
Benchmark accuracy is a useful summary of model performance, but it does not show how sensitive a model comparison is to question wording. We study this sensitivity with selective perturbations: small edits to multiple-choice questions that change the answer of one target model while preserving other models' answers. We implement this idea with a reference-preserving search constraint and evaluate the resulting perturbations on both reference models used during search and unseen models held out from the search. On the full MMLU dev split, unconstrained perturbations often degrade several models at once. With the selectivity constraint, a large target-specific component remains: across Gemma-3-12B, Llama-3.1-8B, and Qwen3.5-9B, target accuracy drops by 0.38--0.44, while reference drops remain at most 0.04 and unseen-model drops at most 0.10. Smaller supporting experiments on GPQA Diamond, within the Gemma family, with Gemini-2.5-Flash as target, and with selective improvement show the same qualitative pattern. Manual inspection suggests that the target-specific component is structured: Qwen3.5-9B is more often affected by coarse substitutions that corrupt domain anchors, while Gemma-3-12B is affected by milder edits such as near-synonyms, register shifts, and casing changes. These results suggest that aggregate benchmark scores can hide not only how often models fail, but also which local changes expose their failures.