Invited Talk
in
Workshop: Tokenization Workshop (TokShop)

Learning Dynamic Segmentation and Compression of Sequences in Transformer LLMs

Adrian Łańcucki

2025 Invited Talk
in
Workshop: Tokenization Workshop (TokShop)

Abstract

Transformer-based LLMs excel at language tasks, but their efficiency hinges on input sequence length. Typically, input resolution—imposed by a tokenizer—remains unchanged across all layers. In this talk, we introduce methods that enable end-to-end learning to dynamically pool, compress, or sparsify input or key-value token sequences. By effectively tracking down and removing redundancies, these methods deliver performance gains during training or inference. We arrive at a surprisingly practical method—Dynamic Memory Sparsification—that allows a model to achieve 8x KV cache compression within just a few hundred training steps. The resulting savings can be used not only to improve throughput and latency, but also to boost accuracy, as demonstrated across several reasoning tasks.

Speaker

Adrian Łańcucki

I'm a senior research engineer at NVIDIA, currently working on LLM optimization for inference. This includes teaching models to compress KV cache, finding and pruning redundancies, and architecture search. My previous research focused on representation learning and generative modeling for text and speech. I hold a Ph.D. in machine learning and remain in active collaboration with academia.

Video

Chat is not available.