Attention Illuminates LLM Reasoning: The Uncovered Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
Abstract
The reasoning patterns of large language models (LLMs) remain opaque, and Reinforcement learning (RL) typically assigns uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work treats attention as a natural substrate for interpreting LLM reasoning and a window for aligning optimization with its internal dynamics. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We quantify these with two metrics measuring the extent of backward attention within a clipped window and the average attention a token receives from subsequent tokens, respectively. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks.