Skip to yearly menu bar Skip to main content


Invited talk
in
Workshop: Machine Learning for Audio Synthesis

Self-supervised learning for speech generation

Wei-Ning Hsu


Abstract:

Self-supervised learning (SSL) for speech has demonstrated great success on inference tasks such as speech recognition. However, it is less studied for generative tasks where the goal is to synthesize speech. In this talk, I will share our recent work on building unconditional and conditional generative speech models leveraging SSL. Instead of representing speech with traditional features like spectrogram, we showed that discrete units derived from self-supervised models serve as better generative modeling targets for several tasks. Specifically, we presented the first text-free spoken language models for prosodically rich speech as well as spoken dialogues, and achieved SOTA performance on speech-to-speech translation without intermediate text output.

Chat is not available.