STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation
Abstract
Continuous Variational Autoencoders (VAEs) serve as the fundamental continuous tokenizer for modern neural audio generation systems, enabling high-fidelity reconstruction while providing a compact, smooth latent space for downstream generative priors. However, continuous VAEs face a fundamental conflict when balancing \textit{compression rate}, \textit{reconstruction fidelity}, and \textit{latent space topology}—a challenge we formalize as the \textbf{Rate-Distortion-Regularity Trilemma}. This trilemma stems from a critical \textit{topological mismatch}: the prevailing isotropic Gaussian prior in standard VAEs imposes a \textit{flat} latent geometry that fails to accommodate audio's \textit{hierarchical} nature, where low-frequency components are structured and compressible while high-frequency components are stochastic and incompressible, leading to \textit{disordered information packing} where crucial semantic features are randomly interleaved with high-entropy noise. To resolve this challenge, we propose \textbf{Structured Topology-Aware Regularization (STAR)}, a general training strategy that reshapes latent space geometry by imposing a growth-based constraint field, routing structural and textural information into channel subspaces with matching capacities. STAR is applicable to any VAE architecture and effectively resolves the trilemma, as demonstrated in CNN-based VAEs. To fully exploit STAR's potential, we present \textbf{STAR-VAE}, combining STAR with a hybrid CNN-Mamba architecture that synergizes local feature extraction with linear-complexity global context modeling, achieving state-of-the-art performance. We further propose \textbf{STAR-Gen}, an LLM-based Flow Matching framework that leverages STAR-VAE's structured latent space for high-fidelity generation without suffering from vector quantization artifacts. Empirical results demonstrate that STAR-VAE successfully resolves the trilemma, achieving state-of-the-art reconstruction fidelity and enhanced semantic information preservation across diverse audio domains. The structured latent space improves both traditional diffusion models and our \textbf{STAR-Gen} paradigm, achieving state-of-the-art performance in text-to-audio generation. The project page is available at~\url{https://STAR-VAE.github.io}.