PADS-TAL: Padding-Annealed Diffusion Sampling in Text-Aware Latent Space for Robust and Diverse Text-to-Music Generation
Abstract
Text-to-Music diffusion models are increasingly used in real-world applications, yet deployment remains challenging: generations can collapse to limited patterns even with diverse initial noise and prompts, and inference-time diversity control often harms text alignment and fidelity by distorting key prompt cues established in early denoising. To address this, we propose Padding-Annealed Diffusion Sampling, which perturbs only a padding-indexed subspace while keeping non-padding conditioning fixed, enabling controlled exploration with reduced semantic drift. However, in a text-unaware VAE latent space, such exploration is less likely to stay within genre-faithful neighborhoods, limiting genre-consistent diversity. We therefore introduce Text-Aware Latent space that aligns local neighborhoods with text-implied genre structure, promoting genre-consistent diversity. Together, the two techniques form a unified pipeline that, compared to prior methods that perturb the full conditioning, achieves a better text alignment--diversity trade-off: at comparable text alignment, it delivers 15.4\% higher diversity with a relatively small fidelity drop, and further improves within-genre diversity by 71.6\%. Generated samples are available at https://pads-tal.github.io/PADS-TAL.io