Timezone: »

Self-supervision in Audio and Speech
Mirco Ravanelli · Dmitriy Serdyuk · R Devon Hjelm · Bhuvana Ramabhadran · Titouan Parcollet

Fri Jul 17 12:05 AM -- 10:00 AM (PDT) @ None
Event URL: https://icml-sas.gitlab.io »

Even though supervised learning using large annotated corpora is still the dominant approach in machine learning, self-supervised learning is gaining considerable popularity. Applying self-supervised learning to audio and speech sequences, however, remains particularly challenging. Speech signals, in fact, are not only high-dimensional, long, and variable-length sequences, but also entail a complex hierarchical structure that is difficult to infer without supervision (e.g.phonemes, syllables, words). Moreover, speech is characterized by an important variability due to different speaker identities, accents, recording conditions and noises that highly increase the level of complexity.

We believe that self-supervised learning will play a crucial role in the future of artificial intelligence, and we think that great research effort is needed to efficiently take advantage of it in audio and speech applications. With our initiative, we wish to foster more progress in the field, and we hope to encourage a discussion amongst experts and practitioners from both academia and industry that might bring different points of view on this topic. Furthermore, we plan to extend the debate to multiple disciplines, encouraging discussions on how insights from other fields (e.g., computer vision and robotics) can be applied to speech, and how findings on speech can be used on other sequence processing tasks. The workshop will be conceived to promote communication and exchange of ideas between machine learning and speech communities. Throughout a series of invited talks, contributed presentations, poster sessions, as well as a panel discussion we want to foster a fruitful scientific discussion that cannot be done with that level of detail during the main ICML conference.

Fri 12:05 a.m. - 12:15 a.m.

Introduction to the workshop.

Link to the video: https://slideslive.com/38930727/opening-remarks-selfsupervision-in-audio-and-speech

Mirco Ravanelli
Fri 12:15 a.m. - 12:40 a.m.

Unsupervised learning of data representations is still an open problem of machine learning. However, the data often has a latent structure which can be exploited to improve learned representations. We will consider two domains having a rich latent structure: speech and handwriting. Both can be interpreted as time signals that encode a natural language message. We show how matching certain properties of the implied latent representation, such as using discrete latent units, explicit modeling of duration, or learned latent dynamics can improve representations obtained using deep neural autoencoders.

Link to the video: https://slideslive.com/38930728/representation-learning-on-sequential-data-with-latent-priors

Jan Chorowski
Fri 12:40 a.m. - 1:05 a.m.

Link to the video: https://slideslive.com/38930729/contrastive-learning-in-audio

Aäron van den Oord
Fri 1:05 a.m. - 1:30 a.m.
Q&A Invited Talks (Q&A)
Fri 1:30 a.m. - 1:45 a.m.

As more and more data is collected in various settings across organizations, companies, and countries, there has been an increase in the demand of user privacy. Developing privacy preserving methods for data analytics is thus an important area of research. In this work we present a model based on generative adversarial networks (GANs) that learns to obfuscate specific sensitive attributes in speech data. We train a model that learns to hide sensitive information in the data, while preserving the meaning in the utterance. The model is trained in two steps: first to filter sensitive information in the spectrogram domain, and then to generate new and private information independent of the filtered one. The model is based on a U-Net CNN that takes mel-spectrograms as input. A MelGAN is used to invert the spectrograms back to raw audio waveforms. We show that it is possible to hide sensitive information such as gender by generating new data, trained adversarially to maintain utility and realism.

Link to the video: https://slideslive.com/38930731/adversarial-representation-learning-for-private-speech-generation

David Ericsson
Fri 1:45 a.m. - 2:00 a.m.

Self-supervised learning from raw speech has been proven beneficial to improve automatic speech recognition (ASR). We investigate here its impact on end-to-end automatic speech translation (AST) performance. We use a contrastive predictive coding (CPC) model pre-trained from unlabeled speech as a feature extractor for a downstream AST task. We show that self-supervised pre-training is particularly efficient in low resource settings and that fine-tuning CPC models on the AST training data further improves performance. Even in higher resource settings, ensembling AST models trained with filter-bank and CPC representations leads to near state-of-the-art models without using any ASR pre-training. This might be particularly beneficial when one needs to develop a system that translates from speech in a language with poorly standardized orthography or even from speech in an unwritten language.

Link to the video: https://slideslive.com/38930733/investigating-selfsupervised-pretranining-for-endtoend-speech-translation

Ha Nguyen
Fri 2:00 a.m. - 2:15 a.m.

Neural network models using predictive coding are interesting from the viewpoint of computational modelling of human language acquisition, where the objective is to understand how linguistic units could be learned from speech without any labels. Even though several promising predictive coding -based learning algorithms have been proposed in the literature, it is currently unclear how well they generalise to different languages and training dataset sizes. In addition, despite that such models have shown to be effective phonemic feature learners, it is unclear whether minimisation of the predictive loss functions of these models also leads to optimal phoneme-like representations. The present study investigates the behaviour of two predictive coding models, Autoregressive Predictive Coding and Contrastive Predictive Coding, in a phoneme discrimination task (ABX task) for two languages with different dataset sizes. Our experiments show a strong correlation between the autoregressive loss and the phoneme discrimination scores with the two datasets. However, to our surprise, the CPC model shows rapid convergence already after one pass over the training data, and, on average, its representations outperform those of APC on both languages.

Link to the video: https://slideslive.com/38930734/analysis-of-predictive-coding-models-for-phonemic-representations-in-small-datasets

María Andrea Cruz Blandón
Fri 2:15 a.m. - 2:30 a.m.

Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results show that our method is in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors.

Link to the Video: https://slideslive.com/38930732/coala-coaligned-autoencoders-for-learning-semantically-enriched-audio-representation

Xavier Favory
Fri 2:30 a.m. - 2:45 a.m.

Self-supervised Audio Transformers (SAT) enable great success in many downstream speech applications like ASR, but how they work has not been widely explored yet. In this work, we present multiple strategies for the analysis of attention mechanisms in SAT. We categorize attentions into explainable categories, where we discover each category possesses its own unique functionality. We provide a visualization tool for understanding multi-head self-attention, importance ranking strategies for identifying critical attention, and attention refinement techniques to improve model performance.

Link to the video: https://slideslive.com/38930730/understanding-selfattention-of-selfsupervised-audio-transformers

Shu-wen Yang
Fri 2:45 a.m. - 3:10 a.m.
Q&A Contributed Talks (Q&A)
Fri 4:00 a.m. - 4:25 a.m.

The basic idea in self-supervised learning (SSL) is to turn an unsupervised learning task into a supervised task, and use well-known supervised methods to solve it. Even though the data initially has no labels or targets to enable supervised learning, we artificially define a "pretext" supervised task, with some labels or targets of our choosing. Here, I focus on two widely-used and fundamental paradigms for SSL. First, adding Gaussian noise to the data and then learning to denoise it, is a special case of the more general SSL principle of corrupting the data and learning to repair it. Second, classification can be used for SSL by first corrupting the data and then learning to discriminate between the original data and the corrupted version; in the extreme case, this means learning to discriminate between the data and pure noise. While these are very intuitive principles, a sophisticated theoretical analysis is possible in both cases. In particular, deep connections to energy-based modelling and nonlinear independent component analysis can be shown.

Link to the video: https://slideslive.com/38930735/denoising-and-realvscorrupted-classification-as-two-fundamental-paradigms-in-selfsupervised-learning

Aapo Hyvarinen
Fri 4:25 a.m. - 4:50 a.m.

We propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of labelled data for speech recognition. In addition, we address the problem of domain differences between the pre-training and fine-tuning data, by adding an explicit adaptation layer during fine-tuning. Experiments with this approach on the LibriSpeech and Wall Street Journal corpora show promising results. The gain from pre-training is additive to that from supervised data augmentation.

Link to the video: https://slideslive.com/38930736/unsupervised-pretraining-of-bidirectional-speech-encoders-via-masked-reconstruction

Karen Livescu
Fri 4:50 a.m. - 5:15 a.m.
Q&A Invited Talks (Q&A)
Fri 5:15 a.m. - 5:30 a.m.

The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio). The visual pretext task drives the audio representations to capture information related to lip movements. This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality. Our method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and significantly outperforms other methods at learning from fewer labels. Notably, our method also outperforms fully supervised training, thus providing a strong initialization for speech related tasks. Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.

Link to the video: https://slideslive.com/38930737/learning-speech-representations-from-raw-audio-by-joint-audiovisual-selfsuperision

Abhinav Shukla
Fri 5:30 a.m. - 5:45 a.m.

We present OtoWorld, an interactive environment in which agents must learn to listen in order to solve navigational tasks. The purpose of OtoWorld is to facilitate reinforcement learning research in computer audition, where agents must learn to listen to the world around them to navigate. OtoWorld is built on three open source libraries: OpenAI Gym for environment and agent interaction, PyRoomAcoustics for ray-tracing and acoustics simulation, and nussl for training deep computer audition models. OtoWorld is the audio analogue of GridWorld, a simple navigation game. OtoWorld can be easily extended to more complex environments and games. To solve one episode of OtoWorld, an agent must move towards each sounding source in the auditory scene and ``turn it off''. The agent receives no other input than the current sound of the room. The sources are placed randomly within the room and can vary in number. The agent receives a reward for turning off a source. We present preliminary results on the ability of agents to win at OtoWorld. OtoWorld is open-source and available.

Link to the video: https://slideslive.com/38930738/otoworld-toward-learning-to-separate-by-learning-to-move

Omkar Ranadive
Fri 5:45 a.m. - 6:00 a.m.

In this paper, we propose a technique for learning speech representations or embeddings in a self supervised manner, and show their performance on emotion classification task. We also investigate the usefulness of these embeddings for languages different from the pretraining corpus. We employ a convolutional encoder model and contrastive loss function on augmented Log Mel spectrograms to learn meaningful representations from an unlabelled speech corpus. Emotion classification experiments are carried out on SAVEE corpus, German EmoDB, and CaFE corpus. We find that: (1) These pretrained embeddings perform better than MFCCs, openSMILE features and PASE+ encodings for emotion classification task. (2) These embeddings improve accuracies in emotion classification task on languages different from that used in pretraining thus confirming language agnostic behaviour.

Link to the video: https://slideslive.com/38930739/language-agnostic-speech-embeddings-for-emotion-classification

Apoorv Nandan
Fri 6:00 a.m. - 6:15 a.m.

We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard Librispeech dataset, and leverage additional unlabeled data from Librivox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways of evaluating the characteristics of unlabeled audio which improve acoustic modeling, and show that acoustic models trained with more audio rely less on external language models.

Link to the video: https://slideslive.com/38930740/endtoend-asr-from-supervised-to-semisupervised-learning-with-modern-architectures

Jacob Kahn
Fri 6:15 a.m. - 6:30 a.m.

In manufacturing settings, workers rely on their sense of hearing, and their knowledge of what sounds correct to help them identify machine quality problems based on the sound pitch, rhythm, timbre and other characteristics of machine operation. Using Machine Learning to classify these sounds has broad applications to automate the manual quality recognition work currently being done, including automating machine operator training, automating quality control detection, and diagnostics across manufacturing and mechanical service industries. We previously established that models taking input pitch information from music domains can dramatically improve classification model performance on industrial machine audio leveraging the CREPE pretrained pitch model. In this work we explore the use of self-supervised learning on pitch-intensive birdsong rather than the CREPE model. To reduce our reliance on a pretrained pitch model and reduce the quantity of labeled industrial audio required, we implement self-supervised representation learning using plentiful, license-free unlabeled, pitch intensive wild birdsong recordings, with audio data augmentation to perform classification on industrial audio. We show that: 1. We can preprocess the unlabeled birdsong data sample with unsupervised methods to eliminate low signal sample and mask low frequency noise leaving just desirable chirp-rich sample. 2. We can identify effective representations and approaches for learning birdsong pitch content by comparing select self-supervised pretext task training of temporal sequence prediction and sequence generation. 3. We can identify effective augmentation methods for learning pitch through comparison of the impact of a variety of audio data augmentation methods on self-supervised learning. And 4. Downstream fine-tuned models deliver improved performance classifying industrial motor audio. We demonstrate that motorized sound classification models using self-supervised learning with a dataset of pitch intensive birdsong, combined with select data augmentation, achieves better results than using the pre-trained CREPE pitch model.

Link to the video: https://slideslive.com/38930741/using-selfsupervised-learning-of-birdsong-for-downstream-industrial-audio-classification

Patty Ryan
Fri 6:30 a.m. - 6:55 a.m.
Q&A Contributed Talks (Q&A)
Fri 6:55 a.m. - 7:20 a.m.

Existing manually-annotated datasets for video understanding differ substantially in their label spaces. Coupled with the limited sizes of these collections, this causes fully-supervised video models to transfer poorly across datasets and tasks.

Link to the video: https://slideslive.com/38930742/selfsupervised-video-models-from-sound-and-speech

Lorenzo Torresani
Fri 7:20 a.m. - 7:45 a.m.

Link to the video: https://slideslive.com/38930743/sights-and-sounds-in-3d-spaces

Kristen Grauman
Fri 7:45 a.m. - 8:10 a.m.

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. We set a new state of the art on both the 100 hour subset of Librispeech as well as on TIMIT phoneme recognition. When lowering the amount of labeled data to one hour, our model outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 5.7/10.1 WER on the noisy/clean test sets of Librispeech. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. Fine-tuning on all of Librispeech achieves 1.9/3.5 WER using a simple baseline model architecture.

Link to the video: https://slideslive.com/38930744/selfsupervised-learning-of-speech-representations-with-wav2vec

Alexei Baevski
Fri 8:10 a.m. - 8:40 a.m.
Q&A Invited Talks (Q&A)
Fri 8:40 a.m. - 8:55 a.m.

Audio scene understanding, parsing sound into a hierarchy of meaningful parts, is an open problem in representation learning. Sound is a particularly challenging domain due to its high dimensionality, sequential dependencies and hierarchical structure. Differentiable Digital Signal Processing (DDSP) greatly simplifies the forward problem of generating audio by introducing differentiable synthesizer and effects modules that combine strong signal priors with end-to-end learning. Here, we focus on the inverse problem, inferring synthesis parameters to approximate an audio scene. We demonstrate that DDSP modules can enable a new approach to self-supervision, generating synthetic audio with differentiable synthesizers and training feature extractor networks to infer the synthesis parameters. By building a hierarchy from sinusoidal to harmonic representations, we show that it possible to use such an inverse modeling approach to disentangle pitch from timbre, an important task in audio scene understanding.

Link to the video: https://slideslive.com/38930745/selfsupervised-pitch-detection-by-inverse-audio-synthesis

JesseEngel Engel
Fri 8:55 a.m. - 9:10 a.m.

Supervised approaches to single-channel speech separation rely on synthetic mixtures, so that the individual sources can be used as targets. Good performance depends upon how well the synthetic mixture data match real mixtures. However, matching synthetic data to the acoustic properties and distribution of sounds in a target domain can be challenging. Instead, we propose an unsupervised method that requires only single-channel acoustic mixtures, without ground-truth source signals. In this method, existing mixtures are mixed together to form a mixture of mixtures, which the model separates into latent sources. We propose a novel loss that allows the latent sources to be remixed to approximate the original mixtures. Experiments show that this method can achieve competitive performance on speech separation compared to supervised methods. In a semi-supervised learning setting, our method enables domain adaptation by incorporating unsupervised mixtures from a matched domain. In particular, we demonstrate that significant improvement to reverberant speech separation performance can be achieved by incorporating reverberant mixtures.

Link to the video: https://slideslive.com/38930746/unsupervised-speech-separation-using-mixtures-of-mixtures

Scott Wisdom
Fri 9:10 a.m. - 9:25 a.m.

Separating an audio scene, such as a cocktail party with multiple overlapping voices, into meaningful components (e.g., individual voices) is a core task in computer audition, analogous to image segmentation in computer vision. Deep networks are the state-of-the-art approach. They are typically trained on synthetic audio mixtures made from isolated sound source recordings so that ground truth for the separation is known. However, the vast majority of available audio is not isolated. The human brain performs an initial segmentation of the audio scene using primitive cues that are broadly applicable to many kinds of sound sources. We present a method to train a deep source separation model in an unsupervised way by bootstrapping using multiple primitive cues. We apply our method to train a network on a large set of unlabeled music recordings to separate vocals from accompaniment without the need for ground truth isolated sources or artificial training mixtures. A companion notebook with audio examples and code for experiments is available: https://github.com/pseeth/bootstrapping-computer-audition.

Link to the video: https://slideslive.com/38930747/bootstrapping-unsupervised-deep-music-separation-from-primitive-auditory-grouping-principles

Prem Seetharaman
Fri 9:25 a.m. - 10:00 a.m.
Q&A Contributed Talks and Closing Remarks (Q&A)

Author Information

Mirco Ravanelli (Mila)
Dmitriy Serdyuk (Mila, University of Montreal)
R Devon Hjelm (Microsoft Research / Mila)
Bhuvana Ramabhadran (Google)
Titouan Parcollet (University of Oxford)

More from the Same Authors