Poster
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
Heting Gao · Kaizhi Qian · Junrui Ni · Chuang Gan · Mark Hasegawa-Johnson · Shiyu Chang · Yang Zhang
Hall C 4-9 #212
Wed 24 Jul 7:30 a.m. PDT — 8:30 a.m. PDT
While self-supervised learning (SSL) in speech has greatly reduced the reliance of speech processing systems on annotated corpora, the success of SSL still hinges on the availability of a large-scale unannotated corpus, which is still often impractical for many low-resource languages or under privacy concerns. Some existing work seeks to alleviate the problem by data augmentation, but most works are confined to introducing perturbations to real speech and do not introduce new variations in speech prosody, speakers, and speech content, which are important for SSL. Motivated by the recent finding that diffusion models have superior capabilities for modeling data distributions, we propose DiffS4L, a pretraining scheme that augments the limited unannotated data with synthetic data with different levels of variations, generated by a diffusion model trained on the limited unannotated data. Finally, an SSL model is pre-trained on the real and the synthetic speech. Our experiments show that DiffS4L can significantly improve the performance of SSL models, such as reducing the WER of the HuBERT pretrained model by 6.26 percentage points in the English ASR task. Notably, we find that the synthetic speech with all levels of variations, i.e. new prosody, new speakers, and even new content (despite the new content being mostly babble), accounts for significant performance improvement. The code is available at github.com/Hertin/DiffS4L.