MFH-NAS:A Hybrid Neural Architecture Search Framework for Multimodal Fusion Object Detection
Abstract
Multimodal fusion object detection faces a substantial modality gap at the same backbone stage. This makes predefined stage-aligned fusion insufficient for cross-stage interactions. We propose MFH-NAS, a hybrid neural architecture search framework that automatically discovers fusion architectures to better leverage cross-modal complementarity. MFH-NAS searches both local fusion primitives and stage-level fusion connectivity. It targets fusion operator design and fusion stage selection. It couples differentiable search with evolutionary search. Differentiable search learns architecture parameters for local fusion primitives. Evolutionary search explores global fusion topologies, including stage selection and cross-stage connection patterns. The joint search balances exploitation and exploration and mitigates premature convergence. It yields fusion structures that strengthen cross-stage interactions.We evaluate MFH-NAS on three public benchmarks, LLVIP, RGBT-Tiny, and M3FD. MFH-NAS consistently outperforms handcrafted fusion-stage designs and prior stage-searching NAS baselines, improving mAP@0.5 from 85.3% to 88.2% over strong fixed-stage fusion methods and delivering gains across all benchmarks.