Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)

Structured matrices for memory-efficient training and finetuning

Beidi Chen

[ ]
Sat 27 Jul 5 a.m. PDT — 5:30 a.m. PDT

Abstract:

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing number of parameters, optimizer states and context lengths. In today’s talk, I will mainly introduce two approaches for reducing memory overhead in pretraining and finetuning stage. I will start with Galore (Gradient Low-Rank Projection), a pretraining strategy that maintains full-parameter learning accuracy and is more memory-efficient than common low-rank adaptation methods such as LoRA. It reduces total training memory usage by up to 63.3%, unlocking the possibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies. Then I will talk about S²FT (Structured Sparse Finetuning) that concurrently achieves SOTA fine-tuning performance, efficiency, and inference scalability by ``selecting sparsely and computing densely". S²FT prevents overfitting and forgetting, delivers SOTA performance on established benchmarks with improvements up to 4.1%, and outperforms full FT by 7.1\% in generalization tasks. S²FT saves the fine-tuning memory up to 3x and improves the throughput by 1.5-2.7x compared to full FT. Finally, I will conclude my talk by briefly introducing our new line of work MST (Mini-seq transformer) and MEGALODON that tackle the activation and attention memory bottlenecks posed by long sequence inputs.

Chat is not available.