Learning Rewrite-Invariant Reasoning with Targeted Alternation Training
Abstract
Large language models (LLMs) often fail in systematic, model-specific ways under meaning-preserving question rewrites (paraphrases, format changes, benign distractors). In this work, we address this instability by identifying where the model's reasoning process diverges across semantically-equivalent inputs. For each target LLM, we sample multiple solution traces under rewrites and aggregate them into a graph of recurring intermediate steps, which pinpoints where incorrect traces diverge from correct ones. We then generate a small set of semantics-preserving examples that mirror the rewrite patterns most responsible for these divergences, and use them to steer the model (\emph{targeted alternation training}), either via fine-tuning or via in-context learning. Across MMLU-Pro, Big-MATH, and DROP, this yields consistent gains and cross-dataset generalization. On Humanity’s Last Exam, using 200 in-context examples, it improves GPT-5.2 (xhigh) from 35.4\% to 38.1\%, demonstrating that targeted alternation training can materially improve a frontier, API-accessible closed model under realistic access constraints.