Skip to yearly menu bar Skip to main content


Poster

Speech Self-Supervised Learning Using Diffusion Model Synthetic Data

Heting Gao · Kaizhi Qian · Junrui Ni · Chuang Gan · Mark Hasegawa-Johnson · Shiyu Chang · Yang Zhang

Hall C 4-9 #212
[ ] [ Paper PDF ]
[ Poster
Wed 24 Jul 4:30 a.m. PDT — 6 a.m. PDT
 
Oral presentation: Oral 4F Labels
Wed 24 Jul 7:30 a.m. PDT — 8:30 a.m. PDT

Abstract:

While self-supervised learning (SSL) in speech has greatly reduced the reliance of speech processing systems on annotated corpora, the success of SSL still hinges on the availability of a large-scale unannotated corpus, which is still often impractical for many low-resource languages or under privacy concerns. Some existing work seeks to alleviate the problem by data augmentation, but most works are confined to introducing perturbations to real speech and do not introduce new variations in speech prosody, speakers, and speech content, which are important for SSL. Motivated by the recent finding that diffusion models have superior capabilities for modeling data distributions, we propose DiffS4L, a pretraining scheme that augments the limited unannotated data with synthetic data with different levels of variations, generated by a diffusion model trained on the limited unannotated data. Finally, an SSL model is pre-trained on the real and the synthetic speech. Our experiments show that DiffS4L can significantly improve the performance of SSL models, such as reducing the WER of the HuBERT pretrained model by 6.26 percentage points in the English ASR task. Notably, we find that the synthetic speech with all levels of variations, i.e. new prosody, new speakers, and even new content (despite the new content being mostly babble), accounts for significant performance improvement. The code is available at github.com/Hertin/DiffS4L.

Chat is not available.