RAT+: Train Dense, Infer Sparse - Recurrence Augmented Attention for Dilated Inference
Xiuying Wei ⋅ Caglar Gulcehre
Abstract
Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size $\mathtt{D}$, while preserving long-range connectivity. However, we find a persistent failure mode of them -sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce **RAT+**, a dense-pretraining architecture that augments attention with *full-sequence recurrence* and *active recurrence learning*. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at $\mathtt{D=16}$ and drops by about 2-3 points at $\mathtt{D=64}$ on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters and 200B tokens and observe the same trend.
Successful Page Load