Skip to yearly menu bar Skip to main content

( events)   Timezone:  
Fri Jul 17 12:05 AM -- 10:00 AM (PDT)
Self-supervision in Audio and Speech
Mirco Ravanelli · Dmitriy Serdyuk · R Devon Hjelm · Bhuvana Ramabhadran · Titouan Parcollet

Workshop Home Page

Even though supervised learning using large annotated corpora is still the dominant approach in machine learning, self-supervised learning is gaining considerable popularity. Applying self-supervised learning to audio and speech sequences, however, remains particularly challenging. Speech signals, in fact, are not only high-dimensional, long, and variable-length sequences, but also entail a complex hierarchical structure that is difficult to infer without supervision (e.g.phonemes, syllables, words). Moreover, speech is characterized by an important variability due to different speaker identities, accents, recording conditions and noises that highly increase the level of complexity.

We believe that self-supervised learning will play a crucial role in the future of artificial intelligence, and we think that great research effort is needed to efficiently take advantage of it in audio and speech applications. With our initiative, we wish to foster more progress in the field, and we hope to encourage a discussion amongst experts and practitioners from both academia and industry that might bring different points of view on this topic. Furthermore, we plan to extend the debate to multiple disciplines, encouraging discussions on how insights from other fields (e.g., computer vision and robotics) can be applied to speech, and how findings on speech can be used on other sequence processing tasks. The workshop will be conceived to promote communication and exchange of ideas between machine learning and speech communities. Throughout a series of invited talks, contributed presentations, poster sessions, as well as a panel discussion we want to foster a fruitful scientific discussion that cannot be done with that level of detail during the main ICML conference.

Opening Remarks (Introduction)
Invited Talk: Representation learning on sequential data with latent priors (Talk)
Invited Talk: Contrastive Predictive Coding for audio representation learning (Talk)
Q&A Invited Talks (Q&A)
Adversarial representation learning for private speech generation (Contributed Talk)
Investigating Self-supervised Pre-training for End-to-end Speech Translation (Contributed Talk)
Analysis of Predictive Coding Models for Phonemic Representation Learning in Small Datasets (Contributed Talk)
COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations (Contributed Talk)
Understanding Self-Attention of Self-Supervised Audio Transformers (Contributed Talk)
Q&A Contributed Talks (Q&A)
Invited Talk: Denoising and real-vs-corrupted classification as two fundamental paradigms in self-supervised learning (Talk)
Invited Talk: Unsupervised pre-training of bidirectional speech encoders via masked reconstruction (Talk)
Q&A Invited Talks (Q&A)
Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision (Contributed Talk)
OtoWorld: Towards Learning to Separate by Learning to Move (Contributed Talk)
Language Agnostic Speech Embeddings for Emotion Classification (Contributed Talk)
End-to-End ASR: from Supervised to Semi-Supervised Learning with Modern Architectures (Contributed Talk)
Using Self-Supervised Learning of Birdsong for Downstream Industrial Audio Classification (Contributed Talk)
Q&A Contributed Talks (Q&A)
Invited Talk: Self-Supervised Video Models from Sound and Speech, Lorenzo Torresani (Talk)
Invited Talk: Sights and sounds in 3D spaces (Talk)
Invited Talk: Self-supervised learning of speech representations with wav2vec (Talk)
Q&A Invited Talks (Q&A)
Self-supervised Pitch Detection by Inverse Audio Synthesis (Contributed Talk)
Unsupervised Speech Separation Using Mixtures of Mixtures (Contributed Talk)
Bootstrapping Unsupervised Deep Music Separation from Primitive Auditory Grouping Principles (Contributed Talk)
Q&A Contributed Talks and Closing Remarks (Q&A)