Poster Wed, Jul 8, 2026 • 6:30 PM – 8:15 PM PDT HALL A #707

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Xiaoyu Yang ⋅ Yifan Yang ⋅ Zengrui Jin ⋅ Ziyun Cui ⋅ Wen Wu ⋅ Baoxiangli ⋅ Chao Zhang ⋅ Phil Woodland

Abstract

Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.

Lay Summary

When humans listen to the world, we effortlessly process both spoken words and background sound events simultaneously. However, existing AI technologies have traditionally separated these tasks, using distinct models for processing human speech and general audio. Combining them into a single system is challenging because they require the computer to focus on entirely different acoustic details. To bridge this gap, we proposed SPEAR. Instead of starting from scratch, SPEAR learns by observing two specialised 'teacher' AI models—one an expert in speech, the other in general audio. We translate complex sound waves into a highly detailed, shared digital alphabet that both domains can understand. Furthermore, we train the system on artificially mixed overlapping sounds, teaching the AI to untangle complex sound scenes, such as speech with background noise or overlapping speech. SPEAR successfully bridges the gap between speech and general audio, outperforming existing unified models while achieving comparable performance as single-domain expert models. This work provides a versatile foundation for future AI systems, allowing them to holistically understand real-world acoustic environments just like we do.