Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Next Generation of Sequence Modeling Architectures

xLSTM: Extended Long Short-Term Memory

Maximilian Beck · Korbinian Pöppel · Markus Spanring · Andreas Auer · Oleksandra Prudnikova · Michael Kopp · Günter Klambauer · Johannes Brandstetter · Sepp Hochreiter


Abstract:

In the 1990s, the constant error carousel and gating were introduced as the central ideas of theLong Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributedto numerous deep learning success stories, in particular they constituted the first Large LanguageModels (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise asimple question: How far do we get in language modeling when scaling LSTMs to billions ofparameters, leveraging the latest techniques from modern LLMs, but mitigating known limitationsof LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilizationtechniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalarmemory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable witha matrix memory and a covariance update rule. Integrating these LSTM extensions into residualblock backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures.Exponential gating and modified memory structures boost xLSTM capabilities to perform favorablywhen compared to state-of-the-art Transformers and State Space Models, both in performance andscaling.

Chat is not available.