Contributed talk
in
Workshop: Machine Learning for Audio Synthesis
Speech De-warping: Unsupervised Pre-training for Data-Efficient Text-to-Speech on Low Resource Languages
MYOUNGSEO SONG
Neural text-to-speech (TTS) models can synthesize natural human speech when being trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. In this paper, we propose an unsupervised pre-training method for reducing the amount of paired data required to train a sequence-to-sequence TTS model, utilizing large untranscribed speech data. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones. For semantically meaningful warping/de-warping, we train a self-supervised phoneme segmentation model and use the segments to warp the spectrograms in a pseudo phoneme level. In addition, as a byproduct of our pre-training process, we can optionally leverage the segment-based data augmentation in fine-tuning stage to further improve the data-efficiency. We empirically demonstrate the effectiveness of our method in a low-resource language scenario, achieving outstanding performance compared to various baselines.