Invited Talk
in
Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)
Structured matrices for memory-efficient training and finetuning
Beidi Chen
Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing number of parameters, optimizer states and context lengths. In today’s talk, I will mainly introduce two approaches for reducing memory overhead in pretraining and finetuning stage. I will start with Galore (Gradient Low-Rank Projection), a pretraining strategy that maintains full-parameter learning accuracy and is more memory-efficient than common low-rank adaptation methods such as LoRA. It reduces total training memory usage by up to 63.3%, unlocking the possibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies. Then I will talk about S²FT (Structured Sparse Finetuning) that concurrently achieves SOTA fine-tuning performance, efficiency, and inference scalability by ``selecting sparsely and computing densely". S²FT prevents overfitting and forgetting, delivers SOTA performance on established benchmarks with improvements up to 4.1%, and outperforms full FT by 7.1\% in generalization tasks. S²FT saves the fine-tuning memory up to 3x and improves the throughput by 1.5-2.7x compared to full FT. Finally, I will conclude my talk by briefly introducing our new line of work MST (Mini-seq transformer) and MEGALODON that tackle the activation and attention memory bottlenecks posed by long sequence inputs.