Adaptive Token Refinement in Long-Tailed Large Vision-Language Models Fine-Tuning
Abstract
While large vision-language models (LVLMs) have shown remarkable adaptability to downstream applications, their fine-tuning process remains susceptible to bias under long-tailed data. Compared to zero-shot scenarios, fine-tuning LVLMs on imbalanced datasets often yields limited performance improvements on tail data. This is because LVLMs tend to rapidly overfit the head data at an early fine-tuning stage, thereby impairing the learning of the tail data while simultaneously failing to exploit their quantitative advantage. Furthermore, in many downstream LVLM scenarios, quantified long-tailed prior knowledge of data distribution is often unavailable, significantly limiting the applicability of traditional long-tailed techniques that rely heavily on such information. To address these issues, we propose the Adaptive Token Refinement (ATR), a novel framework that adaptively refines the learning process of LVLMs under long-tailed data. Specifically, ATR consists of two token-level operations applied to output and input tokens, respectively: 1) a bounded adaptive loss that dynamically filters and reweights output tokens to mitigate overfitting on head data, and 2) a visual token mask strategy that augments the probability paths of input tokens to enhance long-tailed performance. Extensive experiments across multiple benchmarks demonstrate that ATR consistently enhance both performance and generalization for long-tailed LVLMs fine-tuning in a distribution-agnostic manner.