Scaling Transformers for End-to-End Discrete Audio Tokenization
Yitian Gong ⋅ Kuangwei Chen ⋅ Zhaoye Fei ⋅ Xiaogui Yang ⋅ Ke Chen ⋅ Yang Wang ⋅ Kexin Huang ⋅ Mingshu Chen ⋅ Ruixiao Li ⋅ Qinyuan Cheng ⋅ Shimin Li ⋅ Xipeng Qiu
Abstract
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. Based on this perspective, we propose $\textbf{TAC}$, a Transformer-based audio tokenizer that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction of general audio. We show that a simple, fully end-to-end learned tokenizer built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, the proposed tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging TAC’s discrete tokens, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, TAC enables competitive ASR performance without auxiliary encoders. Our findings position TAC as a unified, scalable interface for the next generation of native audio foundation models.
Successful Page Load