ICML Poster BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

Poster

BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

Susan Liang · Dejan Markovic · Israel D. Gebru · Steven Krenn · Todd Keebler · Jacob Sandakly · Frank Yu · Samuel Hassel · Chenliang Xu · Alexander Richard

West Exhibition Hall B2-B3 #W-321

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Poster] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a 42% confusion rate.

Lay Summary:

Imagine you’re wearing headphones and you hear someone speaking — not just the voice, but where they are in the room, and how far away they are. This 3D-like listening experience is called “binaural audio,” and it’s key to making virtual reality, games, and immersive media feel lifelike.Today, creating this kind of audio usually requires specialized equipment or software that either can’t generate high enough quality or can’t do it fast enough for live situations like gaming or live chat.Our research presents a new method called BinauralFlow that uses advanced machine learning to create high-quality, realistic binaural speech from regular (mono) audio --- on the fly, in real time. Instead of just copying sound, our system makes the sound nearly indistinguishable from real recordings.This has the potential to transform how we experience digital sound, enabling more natural-sounding virtual meetings, immersive virtual worlds, and even more convincing avatars or virtual assistants. By improving both speed and realism, BinauralFlow brings us a step closer to seamless, lifelike audio experiences in our everyday digital lives.

Chat is not available.