Skip to yearly menu bar Skip to main content

Workshop: ES-FoMo: Efficient Systems for Foundation Models

Audio-Journey: Efficient Visual+LLM-aided Audio Encodec Diffusion

Juncheng Li · Jackson Michaels · Laura Yao · Lijun Yu · Zach Wood-Doughty · Florian Metze


Despite recent progress, machine learning for the audio domain is limited by the availability of high-quality data.Visual information already presented in a video should complement the information in audio.In this paper, we leverage state-of-the-art (SOTA) Large Language Models (LLMs) to augment the existing weak labels of the audio dataset to enrich captions; we adopt SOTA video-captioning model to automatically generate video caption, and we again use LLMs to merge the audio-visual captions to form a rich dataset of large-scale.Using this dataset, we train a latent diffusion model on the Encodec embeddings.Furthermore, we leverage the trained diffusion model to generate even more audio data of the same format.In our experiment, we first verified that our Audio+Visual Caption is of high quality against baselines and ground truth (12.5\% gain in semantic score against baselines). Moreover, we demonstrate that we could train a classifier from scratch using the diffusion-generated data, or use diffusion to enhance classification models on the AudioSet test set, working in conjunction with mixup or other augmentation methods for impressive performance gains.Our approach exemplifies a promising method for augmenting low-resource audio datasets.The samples, models, and implementation will be at \url{}.

Chat is not available.