Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

Aashiq Muhamed · Oscar Li · David Woodruff · Mona Diab · Virginia Smith


Abstract: Large language model (LLM) training and finetuning are often severely constrained by limited GPU memory. While parameter-efficient finetuning techniques like LoRA address this by learning low-rank weight updates, they frequently underperform compared to full-rank training, especially during pretraining. We propose GRASS (GRAdient Stuctured Sparsification), a novel approach that slashes LLM training memory and compute requirements without compromising performance. GRASS leverages sparse projections to transform gradients into structurally sparse gradients, significantly lowering memory usage for both optimizer states and gradient communication. This compression, in turn, unlocks substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that GRASS achieves comparable performance to existing projection-based optimizers and full-rank training. Notably, GRASS enables pretraining a 13B parameter LLaMA model on a single 40GB A100 GPU---a feat infeasible for previous methods---and yields up to a $2\times$ throughput improvement on an 8-GPU system.

Chat is not available.