FiX: Introducing Fine-grained Forget Gate into Softmax Attention
Abstract
Causal softmax attention is the algorithmic foundation of modern large language models. Inspired by linear attention, recent work has sought to enhance it by incorporating forget gates. However, these efforts, such as FoX, have been limited to coarse, scalar gates. While fine-grained, element-wise gates are shown to be more effective than scalar ones in linear attention, their direct integration into softmax attention is non-trivial due to algebraic constraints. In this work, we introduce Fine-grained Forgetting Transformer (FiX), a novel architecture that successfully enables element-wise forget gates in softmax attention. Our core insight is that the softmax denominator becomes mathematically redundant under a subsequent RMSNorm layer, allowing us to reformulate the forgetting mechanism as a direct element-wise multiplication on the value vectors. This formulation makes FiX the first positional encoding applied to value-output (VO) pairs, designed to be complementary to existing query-key (QK) encodings like RoPE. We systematically address implementation challenges including numerical precision, computational efficiency, and inference memory consumption. Extensive experiments show FiX achieves lower training loss and superior performance on both short-text common sense benchmarks and long-context tasks, opening a new path for building more powerful transformers.