LAVA: A Unified Framework for Finetuning Language and Vision Models
Daorui Ding ⋅ Fanhua Shang ⋅ Tiancan Feng ⋅ Junkang Liu ⋅ Hongying Liu
Abstract
LoRA and its variants have attracted considerable attention because of their abilities to tune a negligible number of parameters while achieving comparable downstream performance. This success is largely attributed to the intrinsic low-rank structure of model parameter spaces, which allows LoRA to train two projection matrices to project weights into a low-dimensional subspace and then map them back. However, it does not consider how to explore this low-rank subspace sufficiently and may lose the expression ability accordingly. Moreover, when using LoRA to tune convolution layers, a flatten operation is required to convert tensors into matrices. We argue that this will degrade the model's performance. In this paper, we address this issue from a general parameter sub-space perspective: we present a unified **L**anguage **A**nd **V**ision **A**daption finetuning framework (called **LAVA**). Specifically, we verify the existence of low-rank subspaces in convolution layers empirically and propose to parameterize the increment of both convolution kernels and matrices as sum of learnable rank-1 components. To improve training stability, we analyze the optimization dynamics of LoRA and incorporate orthogonal regularization into our parameterization, for which we give theoretical proof that it will help reduce the variance of the gradient. We conduct various experiments on different downstreaming tasks to validate LAVA's superiority. For example, when tuning LLaMA2-7b for commonsense tasks, the performance of our LAVA is **+1.9\%** higher than that of LoRA. For metric depth estimation tasks, LAVA only tunes $\sim$1.5\% of Depth-Anything (335.3M), and achieves **+3.5\%** $\delta_1$ accuracy against that of LoRA and **+5.6\%** $\delta_1$ accuracy against that of SVDiff.
Successful Page Load