Low-cost Full Fine-tuning: Learning What to Update for LLMs
Abstract
While Large language models (LLMs) have strong abilities, they generally rely on fine-tuning to supplement downstream task-specific knowledge. Due to the prohibitive memory overhead of full fine-tuning (FT), existing parameter-efficient fine-tuning techniques, e.g., LoRA and Adapters, update parameters only in low-rank or restricted subspaces. However, they fail to approximate FT---the performative fine-tuner---and risk performance degradation in tough tasks. Therefore, we naturally raise a Low-cost Full Fine-tuning question: Can we approach standard full fine-tuning in theory, yet with much lower costs in practice? Our key insight is that performing selective updates at each step can, theoretically, recover FT asymptotically, while being cost-effective and ignoring no parameter direction. This motivates a new general fine-tuning paradigm (called Think-Touch): we first predict potentials of parameter groups (think) and then update only the selected (touch) in one step. Theoretically, we show that under a very weak sufficient condition---divergence of the cumulative coverage of the expected gradient norm---any selection strategy can converge in the full-parameter space to a stationary point at which the FT admits no further first-order improvement. Besides, we further derive the general convergence rate for our paradigm and identify a post-hoc greedy strategy that is rate-optimal. Unfortunately, this strategy cannot be directly applied in practice due to its reliance on full and accurate gradient information. Thus, we propose a bandit-based method to online approximate this ideal strategy in the long run with a rigorous regret guarantee. Extensive experimental results on various tasks demonstrate the potential of our paradigm, including much lower space overheads against FT and better performance than LoRAs.