Semantic Impact–Driven Visual Scheduling in Vision-Language Models
Abstract
Vision-Language Models (VLMs) suffer from high inference latency due to long visual sequences. To enable efficient, on-demand utilization of visual information, we argue that visual necessity should be assessed by its semantic impact on the output distribution, rather than inferred from intermediate interaction signals such as attention weights. We propose a training-free framework based on token embedding subspace decomposition, which we term a prediction-conditioned Semantic Lens. Specifically, at fixed decoding intervals, we perform QR decomposition on the Top-K candidate token embeddings to construct an orthogonal semantic basis. We then introduce Semantic IImpact–Driven Visual Scheduling (SIVS), which measures how visual inputs impact model predictions by projecting visual-induced hidden-state variations onto this semantic lens. SIVS provides a geometrically grounded, impact-driven criterion for dynamic visual KV scheduling. Empirical results demonstrate that SIVS achieves ~87% visual KV compression while maintaining over 99% of model performance.