Poster
in
Workshop: Challenges in Deployable Generative AI
E3-VITS: Emotional End-to-End TTS with Cross-speaker Style Transfer
Wonbin Jung · Junhyeok Lee
Keywords: [ text-to-speech ] [ emotional text-to-speech ] [ end-to-end text-to-speech ] [ speech synthesis ]
Since previous emotional TTS models are based on a two-stage pipeline or additional labels, their training process is complex and requires a high labeling cost. To deal with this problem, this paper presents E3-VITS, an end-to-end emotional TTS model that addresses the limitations of existing models. E3-VITS synthesizes high-quality speeches for multi-speaker conditions, supports both reference speech and textual description-based emotional speech synthesis, and enables cross-speaker emotion transfer with a disjoint dataset. To implement E3-VITS, we propose batch-permuted style perturbation, which generates audio samples with unpaired emotion to increase the quality of cross-speaker emotion transfer. Results show that E3-VITS outperforms the baseline model in terms of naturalness, speaker and emotion similarity, and inference speed.