X-MoGe: A Cross-Modal Adaptation Framework with Mixture-of-Experts and Geometry Guidance for Heterogeneous Collaborative Perception
Abstract
Multi-agent collaborative perception improves perception range and robustness in autonomous driving. However, most existing methods assume homogeneous sensors and perception networks, which is unrealistic in real-world heterogeneous systems. Differences in sensing modalities and independently trained models lead to significant semantic and geometric inconsistencies, limiting effective collaboration. To solve these problems, we propose a novel cross-modal adaptation framework with Mixture-of-Experts and geometry-guided fusion for heterogeneous collaborative perception, named X-MoGe. Specifically, we propose a Pixel-level Mixture-of-Experts (P-MoE) module, which adaptively models modality-specific semantic characteristics under heterogeneous sensing conditions. In addition, a geometry-guided feature fusion module incorporates explicit geometric priors to enforce spatial alignment and consistency in the BEV space. Extensive experiments on OPV2V and DAIR-V2X datasets demonstrate that the proposed method achieves state-of-the-art performance with strong robustness and scalability in heterogeneous collaborative perception.