ICML Poster UniAudio: Towards Universal Audio Generation with Large Language Models

Poster

UniAudio: Towards Universal Audio Generation with Large Language Models

Dongchao Yang · Jinchuan Tian · Xu Tan · Rongjie Huang · Songxiang Liu · Haohan Guo · Xuankai Chang · Jiatong Shi · sheng zhao · Jiang Bian · Zhou Zhao · Xixin Wu · Helen M Meng

[ Abstract ] [ Paper PDF ]

[ Poster]

2024 Poster

Abstract:

Audio generation is a major branch of generative AI research. Compared with prior works in this area that are commonly task-specific with heavy domain knowledge, this paper advocates building universal audio generation models that can handle various tasks in a unified manner. As recent research on large language models (LLMs) has demonstrated their strong ability to handle multiple tasks, this work presents UniAudio, an LLM-based audio generation model that supports a wide range of audio generation tasks. Based on various input conditions, such as phoneme, text description, or audio itself, UniAudio can generate speech, sound, music, and singing voice. The proposed UniAudio is built with 100k hours of multi-source open-available audio data and is scaled to 1B parameters. The audio tokenization method and language model architecture are also specifically designed for both performance and efficiency. Experimentally, UniAuido supports 11 audio generation tasks and achieves competitive results on all tasks consistently. We also show that UniAudio can support new tasks seamlessly via simple fine-tuning.

Chat is not available.