Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen2.5 achieve significant gains even with spurious rewards. We investigate this phenomenon and identify ``Perplexity Paradox'': spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting model is bypassing reasoning in favor of memorization. Using a suite of mechanistic interpretability tools, including Path Patching and Logit Lens, we identify a previously unknown Anchor–Adapter circuit. This circuit enables model to bypass reasoning and directly retrieve memorized solutions under spurious RLVR. We localize a Functional Anchor in middle layers (L18–20) that triggers retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering, i.e., artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models.