Timezone: »
The 1st Machine Learning for Audio Synthesis workshop at ICML will attempt to cover the space of novel methods and applications of audio generation via machine learning. These include, but are not limited to: methods of speech modeling, environmental sound generation or other forms of ambient sound, novel generative models, music generation in the form of raw audio, and text-to-speech methods. Audio synthesis plays a significant and fundamental role in many audio-based machine learning systems, including smart speakers and voice-based interaction systems, real-time voice modification systems, and music or other content generation systems.We plan to solicit original workshop papers in these areas, some of which will present contributed talks and spotlights. Alongside these presentations will be talks from invited speakers, a poster session and interactive live demo session, and an invited speaker panel.We believe that a machine learning workshop focused around generation in the audio domain would provide a good opportunity to bring together both practitioners of audio generation tools along with core machine learning researchers interested in audio, in order to hopefully forge new directions in this important area of research.
Fri 5:55 a.m. - 6:00 a.m.
|
Opening remarks
|
Brian Kulis 🔗 |
Fri 6:00 a.m. - 6:30 a.m.
|
A hierarchical representation learning approach for source separation, transcription, and music generation
(
Invited talk
)
SlidesLive Video » With interpretable music representation learning, music source separation problems are well connected with transcription problems, and transcription problems can be transformed into music arrangement problems. In particular, Gus will discuss two recently developed models. The first one used a pitch-timbre disentanglement to achieve source separation, transcription, and synthesis. The second one used cross-modal chord-texture disentanglement to solve audio-to-symbolic piano arrangement. In the end, Gus will show his vision of a unified hierarchical representation-learning framework that bridges music understanding and generation. |
Gus xia 🔗 |
Fri 6:30 a.m. - 7:00 a.m.
|
Frontiers and challenges in music audio generation
(
Invited talk
)
SlidesLive Video » Despite notable recent progress on generative modeling of text, images, and speech, generative modeling of music audio remains a challenging frontier for machine learning. A primary obstacle of modeling audio is the extreme sequence lengths of audio waveforms, which are impractical to model directly with standard methods. A challenge more specific to modeling music audio is scaling to critical capacity, an elusive threshold of model size beyond which coherent generation emerges. In this talk, I will present strategies from my work which seek to overcome the practical challenges of modeling audio by either (1) exploring featurizations which reduce superfluous information in waveforms, or (2) proposing new methods which can process waveforms directly. I will also share insights from ongoing work on achieving critical capacity for generating broad music audio, i.e., music audio not constrained to a particular instrument or genre. |
Chris Donahue 🔗 |
Fri 7:00 a.m. - 7:20 a.m.
|
DrumGAN VST: A Plugin for Drum Sound Analysis/Synthesis with Autoencoding Generative Adversarial Networks
(
Contributed talk
)
link »
SlidesLive Video » In contemporary popular music production, drum sound design is commonly performed by cumbersome browsing and processing of pre-recorded samples in sound libraries. One can also use specialized synthesis hardware, typically controlled through low-level, musically meaningless parameters. Today, the field of Deep Learning offers methods to control the synthesis process via learned high-level features and allows generating a wide variety of sounds. In this paper, we present DrumGAN VST, a plugin for synthesizing drum sounds using a Generative Adversarial Network. DrumGAN VST operates on 44.1 kHz sample-rate audio, offers independent and continuous instrument class controls, and features an encoding neural network that maps sounds into the GAN's latent space, enabling resynthesis and manipulation of pre-existing drum sounds. We provide numerous sound examples and a demo of the proposed VST plugin. |
🔗 |
Fri 7:20 a.m. - 7:40 a.m.
|
Generating Detailed Music Datasets with Neural Audio Synthesis
(
Contributed talk
)
link »
SlidesLive Video » Generative models are increasingly able to generate realistic, high-quality data in the domain of both symbolic music (i.e. MIDI) and raw audio. These models have also been trained in ways that are increasingly controllable, allowing for deliberate and systematic manipulation of outputs to have desired characteristics. However, despite the promising demonstrated benefits of using synthetic data to improve low-resource learning in other domains, research has not yet leveraged generative models to create large-scale datasets suitable for modern deep learning models in the music domain. In this work, we address this gap by using a generative model of MIDI (Coconet trained on Bach Chorales) with a structured audio synthesis model (MIDI-DDSP trained on URMP). We demonstrate a system capable of producing unlimited amounts of realistic chorale music with rich annotations through controlled synthesis of MIDI through generative models. We call this system the Chamber Ensemble Generator (CEG), and use it to generate a large dataset of chorales (CocoChorales). We demonstrate that data generated using our approach improves state-of-the-art models for music transcription and source separation, and we release both the system and the dataset as an open-source foundation for future work. |
Yusong Wu 🔗 |
Fri 7:40 a.m. - 8:00 a.m.
|
Adversarial Audio Synthesis with Complex-valued Polynomial Networks
(
Contributed talk
)
link »
SlidesLive Video » Time-frequency (TF) representations in audio synthesis have been increasingly modeled with real-valued networks. However, overlooking the complex-valued nature of TF representations can result in suboptimal performance and require additional modules (e.g., for modeling the phase). To this end, we introduce complex-valued polynomial networks, called APOLLO, that integrate such complex-valued representations in a natural way. Concretely, APOLLO captures high-order correlations of the input elements using high-order tensors as scaling parameters. By leveraging standard tensor decompositions, we derive different architectures and enable modeling richer correlations. We outline such architectures and showcase their performance in audio generation across four benchmarks. As a highlight, APOLLO results in 17.5% improvement over adversarial methods and 8.2% over the state-of-the-art diffusion models on SC09 dataset in audio generation. Our models can encourage the systematic design of other efficient architectures on complex field. |
Grigorios Chrysos 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Break
|
🔗 |
Fri 8:30 a.m. - 9:00 a.m.
|
Cooperative conversational AI
(
Invited talk
)
SlidesLive Video » The development of machines that effectively converse with humans is a challenging problem that requires combining complex technologies, such as speech recognition, dialogue systems, and speech synthesis. Current solutions mainly rely on independent modules combined in plain unidirectional pipelines. To reach higher levels of human-computer interactions, we have to radically rethink current conversational AI architectures with a novel cooperative framework. We need to replace standard pipelines with "cooperative networks of deep networks" where all the modules automatically learn how to cooperate, communicate, and interact. This keynote will discuss some novel ideas toward this ambitious goal and will introduce a novel toolkit called SpeechBrain designed to easily implement this holistic approach to Conversational AI. |
Mirco Ravanelli 🔗 |
Fri 9:00 a.m. - 10:30 a.m.
|
Lunch
|
🔗 |
Fri 10:30 a.m. - 12:00 p.m.
|
Poster & Demo Session
|
🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Self-supervised learning for speech generation
(
Invited talk
)
SlidesLive Video » Self-supervised learning (SSL) for speech has demonstrated great success on inference tasks such as speech recognition. However, it is less studied for generative tasks where the goal is to synthesize speech. In this talk, I will share our recent work on building unconditional and conditional generative speech models leveraging SSL. Instead of representing speech with traditional features like spectrogram, we showed that discrete units derived from self-supervised models serve as better generative modeling targets for several tasks. Specifically, we presented the first text-free spoken language models for prosodically rich speech as well as spoken dialogues, and achieved SOTA performance on speech-to-speech translation without intermediate text output. |
Wei-Ning Hsu 🔗 |
Fri 12:30 p.m. - 1:00 p.m.
|
DiffWave: A Versatile Diffusion Model for Audio Synthesis
(
Invited talk
)
SlidesLive Video » DiffWave is a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity. |
Zhifeng Kong 🔗 |
Fri 1:00 p.m. - 1:20 p.m.
|
Break
|
🔗 |
Fri 1:20 p.m. - 1:40 p.m.
|
Speech De-warping: Unsupervised Pre-training for Data-Efficient Text-to-Speech on Low Resource Languages
(
Contributed talk
)
link »
SlidesLive Video » Neural text-to-speech (TTS) models can synthesize natural human speech when being trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. In this paper, we propose an unsupervised pre-training method for reducing the amount of paired data required to train a sequence-to-sequence TTS model, utilizing large untranscribed speech data. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones. For semantically meaningful warping/de-warping, we train a self-supervised phoneme segmentation model and use the segments to warp the spectrograms in a pseudo phoneme level. In addition, as a byproduct of our pre-training process, we can optionally leverage the segment-based data augmentation in fine-tuning stage to further improve the data-efficiency. We empirically demonstrate the effectiveness of our method in a low-resource language scenario, achieving outstanding performance compared to various baselines. |
MYOUNGSEO SONG 🔗 |
Fri 1:40 p.m. - 2:00 p.m.
|
DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs
(
Contributed talk
)
link »
SlidesLive Video » This paper presents DiffGAN-TTS, a novel denoising diffusion probabilistic model (DDPM)-based text-to-speech (TTS) model achieving high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising diffusion generative adversarial networks (GANs), which adopt an adversarially-trained expressive model to approximate the denoising distribution. We show with multi-speaker TTS experiments that DiffGAN-TTS can generate high-fidelity speech samples within only 4 denoising steps. We present an active shallow diffusion mechanism to further speed up inference. A two-stage training scheme is proposed, with a basic TTS acoustic model trained at stage one providing valuable prior information for a DDPM trained at stage two. Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step. |
Songxiang Liu 🔗 |
Fri 2:00 p.m. - 3:00 p.m.
|
Panel Discussion
(
Panel
)
SlidesLive Video » Panel of invited speakers, where two moderators will facilitate discussion, including questions from the audience. |
Mirco Ravanelli · Chris Donahue · Zhifeng Kong · Wei-Ning Hsu · Rachel Manzelli · Sadie Allen 🔗 |
Author Information
Rachel Manzelli (Modulate)
Brian Kulis (Boston University and Amazon)
Sadie Allen (Boston University)
Sander Dieleman (DeepMind)
Yu Zhang (Google)
More from the Same Authors
-
2023 Poster: Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization »
Christopher Liao · Theodoros Tsiligkaridis · Brian Kulis -
2022 : Panel Discussion »
Mirco Ravanelli · Chris Donahue · Zhifeng Kong · Wei-Ning Hsu · Rachel Manzelli · Sadie Allen -
2022 : Opening remarks »
Brian Kulis -
2022 Poster: General-purpose, long-context autoregressive modeling with Perceiver AR »
Curtis Hawthorne · Drew Jaegle · Cătălina Cangea · Sebastian Borgeaud · Charlie Nash · Mateusz Malinowski · Sander Dieleman · Oriol Vinyals · Matthew Botvinick · Ian Simon · Hannah Sheahan · Neil Zeghidour · Jean-Baptiste Alayrac · Joao Carreira · Jesse Engel -
2022 Spotlight: General-purpose, long-context autoregressive modeling with Perceiver AR »
Curtis Hawthorne · Drew Jaegle · Cătălina Cangea · Sebastian Borgeaud · Charlie Nash · Mateusz Malinowski · Sander Dieleman · Oriol Vinyals · Matthew Botvinick · Ian Simon · Hannah Sheahan · Neil Zeghidour · Jean-Baptiste Alayrac · Joao Carreira · Jesse Engel -
2022 Poster: Faster Algorithms for Learning Convex Functions »
Ali Siahkamari · Durmus Alp Emre Acar · Christopher Liao · Kelly Geyer · Venkatesh Saligrama · Brian Kulis -
2022 Spotlight: Faster Algorithms for Learning Convex Functions »
Ali Siahkamari · Durmus Alp Emre Acar · Christopher Liao · Kelly Geyer · Venkatesh Saligrama · Brian Kulis -
2021 Poster: Generating images with sparse representations »
Charlie Nash · Jacob Menick · Sander Dieleman · Peter Battaglia -
2021 Oral: Generating images with sparse representations »
Charlie Nash · Jacob Menick · Sander Dieleman · Peter Battaglia -
2020 Poster: Piecewise Linear Regression via a Difference of Convex Functions »
Ali Siahkamari · Aditya Gangrade · Brian Kulis · Venkatesh Saligrama -
2020 Poster: Deep Divergence Learning »
Kubra Cilingir · Rachel Manzelli · Brian Kulis -
2019 Workshop: Joint Workshop on On-Device Machine Learning & Compact Deep Neural Network Representations (ODML-CDNNR) »
Sujith Ravi · Zornitsa Kozareva · Lixin Fan · Max Welling · Yurong Chen · Werner Bailer · Brian Kulis · Haoji Hu · Jonathan Dekhtiar · Yingyan Lin · Diana Marculescu -
2018 Poster: Parallel WaveNet: Fast High-Fidelity Speech Synthesis »
Aäron van den Oord · Yazhe Li · Igor Babuschkin · Karen Simonyan · Oriol Vinyals · Koray Kavukcuoglu · George van den Driessche · Edward Lockhart · Luis C Cobo · Florian Stimberg · Norman Casagrande · Dominik Grewe · Seb Noury · Sander Dieleman · Erich Elsen · Nal Kalchbrenner · Heiga Zen · Alex Graves · Helen King · Tom Walters · Dan Belov · Demis Hassabis -
2018 Poster: Efficient Neural Audio Synthesis »
Nal Kalchbrenner · Erich Elsen · Karen Simonyan · Seb Noury · Norman Casagrande · Edward Lockhart · Florian Stimberg · Aäron van den Oord · Sander Dieleman · Koray Kavukcuoglu -
2018 Oral: Parallel WaveNet: Fast High-Fidelity Speech Synthesis »
Aäron van den Oord · Yazhe Li · Igor Babuschkin · Karen Simonyan · Oriol Vinyals · Koray Kavukcuoglu · George van den Driessche · Edward Lockhart · Luis C Cobo · Florian Stimberg · Norman Casagrande · Dominik Grewe · Seb Noury · Sander Dieleman · Erich Elsen · Nal Kalchbrenner · Heiga Zen · Alex Graves · Helen King · Tom Walters · Dan Belov · Demis Hassabis -
2018 Oral: Efficient Neural Audio Synthesis »
Nal Kalchbrenner · Erich Elsen · Karen Simonyan · Seb Noury · Norman Casagrande · Edward Lockhart · Florian Stimberg · Aäron van den Oord · Sander Dieleman · Koray Kavukcuoglu -
2017 Poster: Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders »
Cinjon Resnick · Adam Roberts · Jesse Engel · Douglas Eck · Sander Dieleman · Karen Simonyan · Mohammad Norouzi -
2017 Talk: Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders »
Cinjon Resnick · Adam Roberts · Jesse Engel · Douglas Eck · Sander Dieleman · Karen Simonyan · Mohammad Norouzi