Skip to yearly menu bar Skip to main content


Poster

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

Kun Su · Xiulong Liu · Eli Shlizerman


Abstract:

Videos encompass both visual and auditory data, creating perceptually rich experiences where these modalities complement each other. Videos are thus a valuable type of media for training models to investigate the interplay between audio and visual elements. Previous studies of audio-visual modalities primarily focused on either audio-visual representation learning or generative modeling of a modality conditioned on the other, creating an existing disconnect between these two branches. A unified framework that learns representation and generates modalities has not been developed yet. In this work, we introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation. Rather than working with raw video frames and audio data, we perform representation learning and generative modeling within latent spaces. We use a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively. Then we perform the pre-training task of visual-conditioned masked audio token prediction. This training strategy enables the model to engage in contextual learning and simultaneous video-to-audio generation. After the pre-training phase, we can employ the iterative-decoding approach to rapidly generate audio tokens conditioned on visual features. Since VAB is a unified model, its backbone can be fine-tuned for various audio-visual downstream tasks. Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features, leading to competitive results in audio-visual retrieval and classification.

Live content is unavailable. Log in and register to view live content