Timezone: »
While transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. To demonstrate the capabilities of our method, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity beyond 32k tokens, allowing for inference at the context lengths of GPT-4.
Author Information
Amirkeivan Mohtashami (Swiss Federal Institute of Technology Lausanne)
Martin Jaggi (EPFL)
More from the Same Authors
-
2021 : iFedAvg – Interpretable Data-Interoperability for Federated Learning »
David Roschewitz · Mary-Anne Hartley · Luca Corinzia · Martin Jaggi -
2022 : The Gap Between Continuous and Discrete Gradient Descent »
Amirkeivan Mohtashami · Martin Jaggi · Sebastian Stich -
2023 : Layerwise Linear Mode Connectivity »
Linara Adilova · Asja Fischer · Martin Jaggi -
2023 : 🎤 Fast Causal Attention with Dynamic Sparsity »
Daniele Paliotta · Matteo Pagliardini · Martin Jaggi · François Fleuret -
2023 Oral: Second-Order Optimization with Lazy Hessians »
Nikita Doikov · El Mahdi Chayti · Martin Jaggi -
2023 Poster: Second-Order Optimization with Lazy Hessians »
Nikita Doikov · El Mahdi Chayti · Martin Jaggi -
2023 Poster: Special Properties of Gradient Descent with Large Learning Rates »
Amirkeivan Mohtashami · Martin Jaggi · Sebastian Stich -
2021 : Exact Optimization of Conformal Predictors via Incremental and Decremental Learning (Spotlight #13) »
Giovanni Cherubin · Konstantinos Chatzikokolakis · Martin Jaggi -
2021 Poster: Exact Optimization of Conformal Predictors via Incremental and Decremental Learning »
Giovanni Cherubin · Konstantinos Chatzikokolakis · Martin Jaggi -
2021 Poster: Consensus Control for Decentralized Deep Learning »
Lingjing Kong · Tao Lin · Anastasiia Koloskova · Martin Jaggi · Sebastian Stich -
2021 Poster: Quasi-global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data »
Tao Lin · Sai Praneeth Reddy Karimireddy · Sebastian Stich · Martin Jaggi -
2021 Spotlight: Exact Optimization of Conformal Predictors via Incremental and Decremental Learning »
Giovanni Cherubin · Konstantinos Chatzikokolakis · Martin Jaggi -
2021 Spotlight: Quasi-global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data »
Tao Lin · Sai Praneeth Reddy Karimireddy · Sebastian Stich · Martin Jaggi -
2021 Spotlight: Consensus Control for Decentralized Deep Learning »
Lingjing Kong · Tao Lin · Anastasiia Koloskova · Martin Jaggi · Sebastian Stich -
2021 Poster: Learning from History for Byzantine Robust Optimization »
Sai Praneeth Reddy Karimireddy · Lie He · Martin Jaggi -
2021 Spotlight: Learning from History for Byzantine Robust Optimization »
Sai Praneeth Reddy Karimireddy · Lie He · Martin Jaggi -
2020 Poster: Extrapolation for Large-batch Training in Deep Learning »
Tao Lin · Lingjing Kong · Sebastian Stich · Martin Jaggi -
2020 Poster: Optimizer Benchmarking Needs to Account for Hyperparameter Tuning »
Prabhu Teja Sivaprasad · Florian Mai · Thijs Vogels · Martin Jaggi · François Fleuret -
2020 Poster: A Unified Theory of Decentralized SGD with Changing Topology and Local Updates »
Anastasiia Koloskova · Nicolas Loizou · Sadra Boreiri · Martin Jaggi · Sebastian Stich -
2019 Poster: Overcoming Multi-model Forgetting »
Yassine Benyahia · Kaicheng Yu · Kamil Bennani-Smires · Martin Jaggi · Anthony C. Davison · Mathieu Salzmann · Claudiu Musat -
2019 Oral: Overcoming Multi-model Forgetting »
Yassine Benyahia · Kaicheng Yu · Kamil Bennani-Smires · Martin Jaggi · Anthony C. Davison · Mathieu Salzmann · Claudiu Musat -
2019 Poster: Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication »
Anastasiia Koloskova · Sebastian Stich · Martin Jaggi -
2019 Poster: Error Feedback Fixes SignSGD and other Gradient Compression Schemes »
Sai Praneeth Reddy Karimireddy · Quentin Rebjock · Sebastian Stich · Martin Jaggi -
2019 Oral: Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication »
Anastasiia Koloskova · Sebastian Stich · Martin Jaggi -
2019 Oral: Error Feedback Fixes SignSGD and other Gradient Compression Schemes »
Sai Praneeth Reddy Karimireddy · Quentin Rebjock · Sebastian Stich · Martin Jaggi -
2018 Poster: On Matching Pursuit and Coordinate Descent »
Francesco Locatello · Anant Raj · Sai Praneeth Reddy Karimireddy · Gunnar Ratsch · Bernhard Schölkopf · Sebastian Stich · Martin Jaggi -
2018 Oral: On Matching Pursuit and Coordinate Descent »
Francesco Locatello · Anant Raj · Sai Praneeth Reddy Karimireddy · Gunnar Ratsch · Bernhard Schölkopf · Sebastian Stich · Martin Jaggi -
2018 Poster: A Distributed Second-Order Algorithm You Can Trust »
Celestine Mendler-Dünner · Aurelien Lucchi · Matilde Gargiani · Yatao Bian · Thomas Hofmann · Martin Jaggi -
2018 Oral: A Distributed Second-Order Algorithm You Can Trust »
Celestine Mendler-Dünner · Aurelien Lucchi · Matilde Gargiani · Yatao Bian · Thomas Hofmann · Martin Jaggi -
2017 Poster: Approximate Steepest Coordinate Descent »
Sebastian Stich · Anant Raj · Martin Jaggi -
2017 Talk: Approximate Steepest Coordinate Descent »
Sebastian Stich · Anant Raj · Martin Jaggi