Remove the Ambiguity: Few-shot Multimodal Anomaly Detection Using Crossmodal Feature Replacers
Abstract
Reconstruction-based multimodal anomaly detection is fundamentally challenged by the one-to-many crossmodal mapping problem, where a single 3D feature may correspond to multiple valid RGB appearances, often leading to collapsed reconstructions and degraded detection performance. We propose Crossmodal Feature Replacer (CFR), a self-supervised ensemble framework that resolves ambiguous crossmodal reconstructions. CFR first learns cyclic mappings between RGB and 3D features while constructing modality-specific and paired memory banks. It then employs an attention-based retrieval network to identify reliable crossmodal candidates. During inference, unreliable reconstructed features are selectively replaced with high-confidence retrieved features, yielding an unambiguous representation for anomaly detection. We evaluate CFR on MVTec 3D-AD and Eyecandies under few-shot settings. Extensive experiments show that CFR consistently outperforms state-of-the-art methods, achieving 92.3 (82.7) 30\% AUPRO and 74.2 (75.9) I-AUROC in the 1-shot setting, demonstrating its effectiveness in addressing crossmodal reconstruction ambiguity.