Linear Complexity Randomized Self-attention Mechanism

Lin Zheng · Chong Wang · Lingpeng Kong

Hall E #408

Keywords: [ PM: Monte Carlo and Sampling Methods ] [ T: Probabilistic Methods ] [ DL: Theory ] [ DL: Algorithms ] [ DL: Attention Mechanisms ]


Recently, random feature attentions (RFAs) are proposed to approximate the softmax attention in linear time and space complexity by linearizing the exponential kernel. In this paper, we first propose a novel perspective to understand the bias in such approximation by recasting RFAs as self-normalized importance samplers. This perspective further sheds light on an \emph{unbiased} estimator for the whole softmax attention, called randomized attention (RA). RA constructs positive random features via query-specific distributions and enjoys greatly improved approximation fidelity, albeit exhibiting quadratic complexity. By combining the expressiveness in RA and the efficiency in RFA, we develop a novel linear complexity self-attention mechanism called linear randomized attention (LARA). Extensive experiments across various domains demonstrate that RA and LARA significantly improve the performance of RFAs by a substantial margin.

Chat is not available.