Neural-Inspired Modeling of Auditory Selection and Compensation for Audio-Visual Speech Separation
Abstract
Current audio-visual speech separation (AVSS) models typically rely on implicit multimodal fusion, but the absence of explicit modality alignment and reliability modeling often causes semantic misalignment and contaminates speech representations. The brain addresses this with a hierarchy: top-down auditory selection uses visual priors to maintain target-consistent acoustics, while bottom-up cross-modal compensation integrates temporally aligned articulatory cues to reconstruct and stabilize speech. Guided by this principle, we present Neuro-SCNet, an AVSS architecture that makes selection and compensation explicit and reliability-aware. The Auditory Selection Mechanism applies top-down, visually guided gain along the audio pathway to isolate target time-frequency units and suppress distractors. The module preserves the auditory trace with an identity bypass and adds controlled visual refinements via a residual path. A synchrony-driven gate reduces the influence of low-confidence visual cues. Additionally, a lightweight pre-alignment for visual feature pre-processing estimates and corrects small temporal offsets, and a compact magnitude-phase encoder is used to preserve fine acoustic detail to stabilize reconstruction. Evaluations on LRS2, LRS3, and VoxCeleb2 show state-of-the-art separation with improved efficiency, supporting the value of explicit selection and reliability-aware compensation.