Sparser, Faster, Lighter Transformer Language Models
Abstract
Scaling autoregressive large language models (LLMs) has had an unprecedented impact, but at vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, which account for the majority of its parameters and execution FLOPs. To achieve this, we rework how computation is done on modern GPUs when sparsity is detected, introducing a set of new CUDA and Triton kernels that minimize computation and memory overheads during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. The code and kernels shared with this submission will be released under an open-source license to promote adoption and future research to turn sparsity into a new practical axis for the efficiency and scalability of modern foundation models.