Head-in-Head in Linear Attention
Shijie Mei ⋅ Man Yao ⋅ Jiabo Tong ⋅ Bo XU ⋅ Guoqi Li
Abstract
The state-transition (decay) matrix governs how fixed-size memory is updated and used, making it a core design in linear attention models. Prior work exploits rank-1 approximations to reduce the cost of constructing decay matrices, but this low-rank constraint also limits the expressive capacity. We therefore formulate decay-matrix design as an open optimization problem: maximizing expressiveness while introducing minimal additional cost. Inspired by the multi-head mechanism, we propose Head-in-Head, which introduces an additional mask matrix to structure memory partitioning and interactions within a single linear-attention head. This simple, generic, and efficient design: \romannumeral1) enables a rank-$r$ approximation of the decay matrix with only a few extra parameters and \romannumeral2) strengthens intra-head information interaction. We further develop mask normalization and a chunk-wise parallelization scheme to support efficient parallel training. Extensive experiments on synthetic benchmarks and language modeling tasks, together with visual analyses, show that Head-in-Head consistently improves baseline performance by enriching information diversity and strengthening intra-head interactions.
Successful Page Load