Sim2Real evaluation of Visual-to-Echo Distillation for Binaural Depth Prediction
Abstract
Echo reflections encode physical cues about object distance, geometry, and surface material that are useful for spatial reasoning. Prior works proposed to incorporate echo reflections as a modality into depth prediction through direct fusion or cross-modal knowledge distillation from vision to audio, but evaluation has been confined to simulated environments such as Replica and Matterport3D, leaving real-world viability untested. In this short paper, we evaluate Visual2Echo Compositional Contrastive Learning (V2E-CCL), a knowledge distillation framework that predicts depth using binaural echoes by aligning cross modal representations in a shared latent space, on real binaural recordings from the BatVision dataset. We report and analyse our findings against the strongest audio-only baseline, demonstrating that vision-to-echo distillation generalises beyond simulation. Code is available upon acceptance.