Oral
Thu Jul 12th 04:20 -- 04:40 PM @ A3
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Yuxuan Wang · Daisy Stanton · Yu Zhang · RJ-Skerry Ryan · Eric Battenberg · Joel Shor · Ying Xiao · Ye Jia · Fei Ren · Rif Saurous

In this work, we propose global style tokens'' (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretablelabels'' they generate can be used to control synthesis in novel ways, such as varying speed and speaking style -- independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.