AD-BTS: Adaptive Dual-Branch Token Sparsification via Spatial Information Density
Abstract
High-resolution visual encoders in multimodal large language models (MLLMs) substantially improve fine-grained perception, yet incur prohibitive computational costs.Existing token pruning methods are effective on natural images but struggle with spatially sparse structured inputs (e.g., charts), where critical high-frequency information is sparse, localized, and structurally essential. To address this challenge, we propose Adaptive Dual-Branch Token Sparsification (AD-BTS), a density-aware framework that dynamically allocates computation according to input signal characteristics. Specifically, AD-BTS introduces a Gradient-based Routing Gate (GRG) that uses lightweight pixel-level gradient statistics to estimate structural flatness and guide routing. Then, AD-BTS activates either a Redundancy Selection Branch (RSB) for aggressive token pruning with a frozen encoder, or a Structural Fusion Branch (SFB) with conditional LoRA and context fusion to preserve sparse structural information.Extensive experiments on Qwen2.5-VL demonstrate that AD-BTS establishes a new Pareto frontier between efficiency and accuracy. Under extreme compression (20% token retention), AD-BTS outperforms the strongest baseline by 12.1% on ChartQA while achieving a 1.8× prefill speedup, effectively reconciling computational efficiency with structural robustness.