Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
Abstract
Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet controllable emotional expression remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into how emotional variation is represented internally and restricting fine-grained control. In this work, we analyze emotion-related variation in the speech semantic hidden states of LLM-based TTS models. To this end, we leverage sparse autoencoders (SAEs) to map these representations to sparse latent features and examine their emotion-related activation patterns. Our evaluations indicate that emotional variation is distributed across multiple sparse latent features, revealing a more interpretable internal representation. Building on this observation, we introduce a feature-level intervention framework that enables targeted and bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features correlate with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift.