( events)   Timezone: »  
The 2020 schedule is still incomplete Program Highlights »
Fri Jul 17 12:05 AM -- 10:00 AM (PDT)
Self-supervision in Audio and Speech
Mirco Ravanelli · Dmitriy Serdyuk · R Devon Hjelm · Bhuvana Ramabhadran · Titouan Parcollet

Even though supervised learning using large annotated corpora is still the dominant approach in machine learning, self-supervised learning is gaining considerable popularity. Applying self-supervised learning to audio and speech sequences, however, remains particularly challenging. Speech signals, in fact, are not only high-dimensional, long, and variable-length sequences, but also entail a complex hierarchical structure that is difficult to infer without supervision (e.g.phonemes, syllables, words). Moreover, speech is characterized by an important variability due to different speaker identities, accents, recording conditions and noises that highly increase the level of complexity.

We believe that self-supervised learning will play a crucial role in the future of artificial intelligence, and we think that great research effort is needed to efficiently take advantage of it in audio and speech applications. With our initiative, we wish to foster more progress in the field, and we hope to encourage a discussion amongst experts and practitioners from both academia and industry that might bring different points of view on this topic. Furthermore, we plan to extend the debate to multiple disciplines, encouraging discussions on how insights from other fields (e.g., computer vision and robotics) can be applied to speech, and how findings on speech can be used on other sequence processing tasks. The workshop will be conceived to promote communication and exchange of ideas between machine learning and speech communities. Throughout a series of invited talks, contributed presentations, poster sessions, as well as a panel discussion we want to foster a fruitful scientific discussion that cannot be done with that level of detail during the main ICML conference.

Opening Remarks (Introduction)
Mirco Ravanelli
Invited Talk: Representation learning on sequential data with latent priors (Talk)
Jan Chorowski
Invited Talk: Contrastive Predictive Coding for audio representation learning (Talk)
Aäron van den Oord
Q&A Invited Talks (Q&A)
Adversarial representation learning for private speech generation (Contributed Talk)
David Ericsson
Investigating Self-supervised Pre-training for End-to-end Speech Translation (Contributed Talk)
Ha Nguyen
Analysis of Predictive Coding Models for Phonemic Representation Learning in Small Datasets (Contributed Talk)
María Andrea Cruz Blandón
COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations (Contributed Talk)
Xavier Favory
Understanding Self-Attention of Self-Supervised Audio Transformers (Contributed Talk)
Shu-wen Yang
Q&A Contributed Talks (Q&A)
Invited Talk: Denoising and real-vs-corrupted classification as two fundamental paradigms in self-supervised learning (Talk)
Aapo Hyvarinen
Invited Talk: Unsupervised pre-training of bidirectional speech encoders via masked reconstruction (Talk)
Karen Livescu
Q&A Invited Talks (Q&A)
Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision (Contributed Talk)
Abhinav Shukla
OtoWorld: Towards Learning to Separate by Learning to Move (Contributed Talk)
Omkar Ranadive
Language Agnostic Speech Embeddings for Emotion Classification (Contributed Talk)
Apoorv Nandan
End-to-End ASR: from Supervised to Semi-Supervised Learning with Modern Architectures (Contributed Talk)
Jacob Kahn
Using Self-Supervised Learning of Birdsong for Downstream Industrial Audio Classification (Contributed Talk)
Patty Ryan
Q&A Contributed Talks (Q&A)
Invited Talk: Self-Supervised Video Models from Sound and Speech, Lorenzo Torresani (Talk)
Lorenzo Torresani
Invited Talk: Sights and sounds in 3D spaces (Talk)
Kristen Grauman
Invited Talk: Self-supervised learning of speech representations with wav2vec (Talk)
Alexei Baevski
Q&A Invited Talks (Q&A)
Self-supervised Pitch Detection by Inverse Audio Synthesis (Contributed Talk)
JesseEngel Engel
Unsupervised Speech Separation Using Mixtures of Mixtures (Contributed Talk)
Scott Wisdom
Bootstrapping Unsupervised Deep Music Separation from Primitive Auditory Grouping Principles (Contributed Talk)
Prem Seetharaman
Q&A Contributed Talks and Closing Remarks (Q&A)