Invited talk
in
Workshop: Machine Learning for Audio Synthesis
Self-supervised learning for speech generation
Wei-Ning Hsu
Self-supervised learning (SSL) for speech has demonstrated great success on inference tasks such as speech recognition. However, it is less studied for generative tasks where the goal is to synthesize speech. In this talk, I will share our recent work on building unconditional and conditional generative speech models leveraging SSL. Instead of representing speech with traditional features like spectrogram, we showed that discrete units derived from self-supervised models serve as better generative modeling targets for several tasks. Specifically, we presented the first text-free spoken language models for prosodically rich speech as well as spoken dialogues, and achieved SOTA performance on speech-to-speech translation without intermediate text output.