Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models
Abstract
Spoken Language Models (SLMs) revolutionize speech synthesis by bypassing traditional linguistic front-ends, yet they remain limited by the digital resource disparities across languages. We investigate these challenges within the Southeast Asian linguistic landscape, using the phonetically complex Thai and data-scarce Lao as representative cases for low-resource SLMs. Scaling experiments reveal that reliance on synthetic data triggers a Stability-Expressivity Gap, characterized by a non-monotonic degradation we term Synthetic Erosion. To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our methods achieve state-of-the-art results, including the first zero-shot voice cloning capability for Lao, establishing a scalable pathway for high-fidelity synthesis across the global linguistic long-tail. Audio Samples are available at: \url{https://anonymous.4open.science/api/repo/multilantts-demo-EEF6/file/index.html?v=2de23271}.