Timezone: »
Self-supervised learning (SSL) for speech has demonstrated great success on inference tasks such as speech recognition. However, it is less studied for generative tasks where the goal is to synthesize speech. In this talk, I will share our recent work on building unconditional and conditional generative speech models leveraging SSL. Instead of representing speech with traditional features like spectrogram, we showed that discrete units derived from self-supervised models serve as better generative modeling targets for several tasks. Specifically, we presented the first text-free spoken language models for prosodically rich speech as well as spoken dialogues, and achieved SOTA performance on speech-to-speech translation without intermediate text output.
Author Information
Wei-Ning Hsu (FAIR)
More from the Same Authors
-
2022 : Panel Discussion »
Mirco Ravanelli · Chris Donahue · Zhifeng Kong · Wei-Ning Hsu · Rachel Manzelli · Sadie Allen -
2022 Poster: data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language »
Alexei Baevski · Wei-Ning Hsu · Qiantong Xu · Arun Babu · Jiatao Gu · Michael Auli -
2022 Oral: data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language »
Alexei Baevski · Wei-Ning Hsu · Qiantong Xu · Arun Babu · Jiatao Gu · Michael Auli