PRIM:Cooperative Dynamic Token Compression for Efficient Large Multimodal Models
Abstract
Large multimodal models (LMMs) have shown strong capabilities in audio-visual understanding by jointly reasoning over visual, auditory, and linguistic inputs. However, processing long-form audio-visual content often requires a large number of tokens, leading to substantial computational and memory overhead during inference. Existing efficiency-oriented methods typically apply uniform compression or pruning strategies, which overlook modality-specific characteristics and instruction-dependent reasoning behaviors in multimodal models. In this work, we present PRIM, an inference framework for efficient multimodal reasoning that systematically compresses audio-visual representations based on attention dynamics and instruction relevance. Motivated by an attention-based analysis revealing modality imbalance and layer-wise redundancy in LMMs, PRIM introduces a cooperative compression pipeline that spans both multimodal encoders and the language model. Specifically, PRIM performs early text-conditioned audio-visual fusion to externalize cross-modal interactions, applies attention-guided and frequency-aware strategies to remove redundant audio and video tokens, and further adapts token retention inside the language model according to task demands. Extensive experiments on multiple audio-visual benchmark datasets demonstrate that PRIM consistently achieves stable and superior efficiency--accuracy trade-offs across diverse tasks and datasets. These results demonstrate that PRIM, a multimodal cooperative compression approach, provides an effective pathway toward scalable and efficient audio-visual reasoning.