Track: Natural Language and Speech Processing 1

Thu 12 July 7:00 - 7:20 PDT

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

RJ Skerry-Ryan · Eric Battenberg · Ying Xiao · Yuxuan Wang · Daisy Stanton · Joel Shor · Ron Weiss · Robert Clark · Rif Saurous

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

Thu 12 July 7:20 - 7:40 PDT

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Yuxuan Wang · Daisy Stanton · Yu Zhang · RJ-Skerry Ryan · Eric Battenberg · Joel Shor · Ying Xiao · Ye Jia · Fei Ren · Rif Saurous

In this work, we propose global style tokens'' (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretablelabels'' they generate can be used to control synthesis in novel ways, such as varying speed and speaking style -- independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Thu 12 July 7:40 - 7:50 PDT

Fitting New Speakers Based on a Short Untranscribed Sample

Eliya Nachmani · Adam Polyak · Yaniv Taigman · Lior Wolf

Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, places the speaker in the embedding space. This network is trained as part of the speech synthesis system using various consistency losses. Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples.

Thu 12 July 7:50 - 8:00 PDT

Towards Binary-Valued Gates for Robust LSTM Training

Zhuohan Li · Di He · Fei Tian · Wei Chen · Tao Qin · Liwei Wang · Tie-Yan Liu

Long Short-Term Memory (LSTM) is one of the most widely used recurrent structures in sequence modeling. It aims to use gates to control information flow (e.g., whether to skip some information or not) in the recurrent computations, although its practical implementation based on soft gates only partially achieves this goal. In this paper, we propose a new way for LSTM training, which pushes the output values of the gates towards 0 or 1. By doing so, we can better control the information flow: the gates are mostly open or closed, instead of in a middle state, which makes the results more interpretable. Empirical studies show that (1) Although it seems that we restrict the model capacity, there is no performance drop: we achieve better or comparable performances due to its better generalization ability; (2) The outputs of gates are not sensitive to their inputs: we can easily compress the LSTM unit in multiple ways, e.g., low-rank approximation and low-precision approximation. The compressed models are even better than the baseline models without compression.

Main Navigation

Session

Natural Language and Speech Processing 1

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Fitting New Speakers Based on a Short Untranscribed Sample

Towards Binary-Valued Gates for Robust LSTM Training