OmniFit: Bridging Modalities via Layer-Adaptive Token Compression for Omnimodal Large Language Models
Zining Wang ⋅ Zhihang Yuan ⋅ Yingjie Zhai ⋅ Wenshuo Li ⋅ Han Shu ⋅ Ruihao Gong ⋅ Jinyang Guo ⋅ Xianglong Liu
Abstract
Emerging Omni-modal Large Language Models (OmniLLMs) enable real-time interaction across video, audio, and text but suffer from prohibitive computational costs due to the quadratic complexity of processing continuous streaming inputs. Existing token compression strategies remain suboptimal as they typically rely on biased modality-centric priors or enforce uniform retention policies, neglecting the heterogeneity across layers and the critical role of cross-modality alignment. To address these challenges, we propose OmniFit, a training-free framework that decouples interaction profiling from inference execution. OmniFit incorporates Layer-Adaptive Heterogeneity Profiling (LAHP) to dynamically allocate computational budgets based on layer-wise redundancy and modality preferences, preserving tokens according to the characteristics of each layer. Furthermore, we introduce Alignment-Rectified Token Selection (ARTS), a lightweight mechanism that efficiently identifies tokens semantically aligned with cross-modal cues. Extensive experiments on 3 model series across 10 benchmarks demonstrate that OmniFit establishes a new Pareto frontier, retaining 98\% of model performance with only 20\% token usage and achieves up to 2.31$\times$ end-to-end inference speedup and 2.5$\times$ VRAM saving, significantly outperforming state-of-the-art methods.
Successful Page Load