Towards Understanding the Dynamics of Low-Rank Adaptation
Shu Ding ⋅ Yang Peng ⋅ Hangan Zhou ⋅ Xinyu Lu ⋅ Shangwei Chen ⋅ Junhua Huang ⋅ Mingxuan Yuan ⋅ Wei Wang
Abstract
Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning technique, and previous works have studied the update dynamics of LoRA, showing that updating via the low-rank matrix $\mathbf{A}$ can be viewed as a process within the compressed subspace defined by $\mathbf{A}^{\top} \mathbf{A}$ of the gradient $\nabla f\left(\mathbf{W} \right)$. However, few works analyze how the properties of the low-rank matrices affect the performance of LoRA, since existing methods heuristically initialize the low-rank matrices as Gaussian matrices. In this paper, we provide a theoretical understanding of the update dynamics of LoRA. We reveal that the update dynamics can be viewed as a process within the subspace projected by $\mathbf{A}^{\top} (\mathbf{A} \mathbf{A}^{\top})^{\dagger} \mathbf{A}$, and prove that when the gradient $\nabla f\left(\mathbf{W} \right)$ is unavailable, if $\mathbf{A}$ is an Equiangular Tight Frame (ETF), $\mathbf{A}^{\top} \mathbf{A}$ and $\mathbf{A}^{\top} (\mathbf{A} \mathbf{A}^{\top})^{\dagger} \mathbf{A}$ can preserve the maximum information from the gradient $\nabla f\left(\mathbf{W} \right)$. Thus, initializing $\mathbf{A}$ as an ETF is the optimal solution for low-rank adaptation when the gradient $\nabla f\left(\mathbf{W} \right)$ is unavailable. Furthermore, we establish the convergence of Low-Rank Adaptation with a rate of $\mathcal{O}\left(\frac{1}{T}\right)$ when $\mathbf{A}$ is an ETF. Extensive experiments show that initializing the low-rank matrices as ETFs significantly outperforms the commonly used Gaussian initialization for existing primary LoRA variants.
Successful Page Load