Atomic Chess as a Counterfactual Benchmark for Quantifying Rule-Conditioned Generalization
Ryan Co ⋅ Karthik R Konuganti
Abstract
Quantifying when LLMs apply a stated rule rather than recite training-distribution patterns requires benchmarks with controllable hardness and a decomposable error surface. We introduce a benchmark construction for *rule-conditioned generalization* built on atomic chess, a variant that preserves the board, pieces, and text notation, but where capturing a piece explodes adjacent pieces. The construction has three properties relevant to quantitative evaluation: (i) a tunable hardness parameter, *variant divergence*, that filters to positions where the standard-rule prior conflicts with the counterfactual rule, sharpening evaluation from average performance to worst-case prior-conflict performance; (ii) a paired oracle structure that permits position comparison on a calibrated Win% scale; and (iii) decomposability of failures for fine-grained attribution. On 200 source-balanced variant-divergent positions, Claude Opus 4.6 and GPT-5.4 incur $2.1$--$4.6\times$ higher mean Win% loss under atomic than under standard rules on identical FENs, with stratified bootstrap intervals. Failure attribution by trace inspection localizes the dominant composition-level mode as *unpropagated refutation*: local rule application that fails to control action selection. We position the construction as a template for benchmarks with well-defined diagnostic properties for compositional generalization under prior conflict.
Successful Page Load