Poster
in
Affinity Event: The 6th Muslims in ML (MusIML) Workshop

Sim2Real evaluation of Visual-to-Echo Distillation for Binaural Depth Prediction

Nazrul Ismail ⋅ Wee Hong Ong ⋅ Owais A Malik

Project Page

Abstract

Echo reflections encode physical cues about object distance, geometry, and surface material that are useful for spatial reasoning. Prior works proposed to incorporate echo reflections as a modality into depth prediction through direct fusion or cross-modal knowledge distillation from vision to audio, but evaluation has been confined to simulated environments such as Replica and Matterport3D, leaving real-world viability untested. In this short paper, we evaluate Visual2Echo Compositional Contrastive Learning (V2E-CCL), a knowledge distillation framework that predicts depth using binaural echoes by aligning cross modal representations in a shared latent space, on real binaural recordings from the BatVision dataset. We report and analyse our findings against the strongest audio-only baseline, demonstrating that vision-to-echo distillation generalises beyond simulation. Code is available upon acceptance.