Hyper-LLaVA: Hyperbolic Uncertainty-aware Modality-Balanced Routing for Multimodal Continual Instruction Tuning
Abstract
Multimodal Continual Instruction Tuning (MCIT) aims to exploit the incrementally accumulated knowledge to process multimodal inputs of diverse tasks, where parameter routing is an important technology. Existing advanced methods typically rely on sample to task center similarity and cross-modal fusion with equal weight during routing. However, such solutions face two fundamental flaws: (1) Within each modality, sample to task center distance is sub-optimal for routing since the abundant intra-task diversity information is underleveraged. (2) Different modalities exhibit varying reliability across tasks, where the modality with inter-task ambiguity can easily misguide the routing result. To address these problems, we propose Hyperbolic Uncertainty-aware Modality-Balanced Routing (Hyper-LLaVA) to improve parameter routing capacity based on cross-modality task feature uncertainty modeling. Specifically, to improve intra-modality task matching, Hyper-LLaVA accesses the sample to task distribution similarity in the Hyperbolic space. Besides, to alleviate the degradation brought by unreliable modality, Hyper-LLaVA quantifies the task matching ambiguity within each modality to achieve adaptive balancing between task matching across modalities. Based on the complementary intra- and inter-modality task matching enhancement, our Hyper-LLaVA outperforms state-of-the-art approaches by large margins.