AudioMosaic: Contrastive Masked Audio Representation Learning
Abstract
Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data and has achieved remarkable progress in recent years. While most prior work relies on generative reconstruction objectives, contrastive approaches remain relatively underexplored, in part due to the high computational cost of designing effective augmentation strategies and the large batch sizes typically required for pre-training. In this work, we introduce AudioMosaic, an audio encoder for general audio understanding. During pre-training, AudioMosaic applies time–frequency masking to spectrogram patches to form paired inputs, employing an elegant and efficient augmentation strategy that significantly reduces computational cost while supporting large-batch training. The AudioMosaic encoder learns discriminative utterance-level representations that exhibit strong transferability across datasets, domains, and acoustic conditions. Extensive experiments demonstrate that AudioMosaic achieves state-of-the-art performance on multiple standard benchmarks. Moreover, we show that the pretrained AudioMosaic encoder enhances audio perception when integrated with large language models (LLMs).