EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs
Abstract
Audio-Visual Large Language Models (AV-LLMs) grapple with the prohibitive computational costs of processing massive, redundant audio and video tokens. Existing unimodal compression techniques fail to capture the heterogeneous and mutually influential information density of joint audio-visual signals. Furthermore, we identify a fundamental and previously overlooked theoretical bottleneck in sparse token reduction: positional aliasing. We demonstrate that aggressive sparse sampling on standard position-encoded sequences violates the Nyquist limit relative to the effective token interval, causing phase-wrapping collisions that corrupt temporal monotonicity. To address this, we introduce EchoingPixels, a framework for aliasing-resistant joint token reduction. First, our Cross-Modal Semantic Sieve performs extractive selection on the synergistic audio-visual stream, learning to dynamically allocate budgets based on joint-modality saliency rather than fixed ratios per modality. Second, to resolve the aliasing issue, we derive Sync-RoPE, a mechanism that acts as a spectral low-pass filter for Rotary Positional Embeddings. By adapting the encoding bandwidth to the sparse sampling rate, Sync-RoPE preserves monotonic temporal relationships in the reduced stream. Extensive experiments show that EchoingPixels achieves performance comparable to full models using only 5-20% of original tokens, validating that a theoretically-grounded approach to sparse learning offers a robust solution for efficient AV-LLMs.