Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models
Bo Gao ⋅ Michael Spratling ⋅ Letizia Gionfrida
Abstract
Large language models have achieved remarkable success in recent years, primarily due to self-attention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by $l_{1}$-normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens, mitigating the attention sink phenomenon, and fundamentally improving length extrapolation. This novel, two-stage, replacement for self-attention is shown to ensure numerical stability and dramatically improve length extrapolation, maintaining a nearly constant validation loss at 16$\times$ the training length while achieving superior results on challenging long-context retrieval tasks and standard downstream benchmarks. Furthermore, symbolic regression experiments demonstrate that our method enables models to recover Newton's gravitational law from orbital trajectory sequences, providing evidence that appropriate attention mechanisms are crucial in order for foundation models to develop genuine physical world models.
Successful Page Load