Ask Less, See More: Communication-Conditioned Token Pruning for Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models
Abstract
Multimodal Large Language Models (MLLMs) have recently emerged as a promising paradigm for vehicle-to-vehicle (V2V) cooperative autonomous driving, enabling language-based joint perception, prediction, and decision-making in safety-critical scenarios with severe occlusions. However, existing V2V–MLLM frameworks rely on dense token-level sharing and fusion, leading to high communication and inference costs. Moreover, conventional V2V perception methods are limited to feature-sharing paradigms without language reasoning, and existing generic token pruning strategies fail to consider LiDAR-specific spatial structure and multi-agent fusion. To address these limitations, we propose V2V Communication-Conditioned MLLM Framework (V2V-CCM), a dual-stage communication coop- erative framework that broadcasts request messages to all agents and uses them to identify redundant visual tokens. Specifically, Question Semantic Message (QSM) encodes the global question intent to guide question-relevant token selection, while Question Semantic Message (QSM) summarizes LiDAR features to identify spatially redundant tokens that are already observed and therefore need not be transmitted. By integrating this strategy into dual-stage frameworks, our method substantially reduces communication and inference costs while preserving question-relevant tokens and spatially redundant tokens. Extensive experiments on the V2V-QA and V2V-GoT-QA datasets demonstrate that V2V-CCM consistently outperforms existing pruning methods and achieves state-of-the-art performance.