Skip to yearly menu bar Skip to main content


Session

Generative Models 4

Abstract:
Chat is not available.

Fri 13 July 7:00 - 7:20 PDT

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

AƤron van den Oord · Yazhe Li · Igor Babuschkin · Karen Simonyan · Oriol Vinyals · Koray Kavukcuoglu · George van den Driessche · Edward Lockhart · Luis C Cobo · Florian Stimberg · Norman Casagrande · Dominik Grewe · Seb Noury · Sander Dieleman · Erich Elsen · Nal Kalchbrenner · Heiga Zen · Alex Graves · Helen King · Tom Walters · Dan Belov · Demis Hassabis

The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system.However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting.This paper introduces Probability Density Distillation, a new methodfor training a parallel feed-forward network from a trained WaveNet with no significant difference in quality.The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, a 1000x speed up relative to the original WaveNet, and capable of serving multiple English and Japanese voices in a production setting.

Fri 13 July 7:20 - 7:40 PDT

Autoregressive Quantile Networks for Generative Modeling

Georg Ostrovski · Will Dabney · Remi Munos

We introduce autoregressive implicit quantile networks (AIQN), a fundamentally different approach to generative modeling than those commonly used, that implicitly captures the distribution using quantile regression. AIQN is able to achieve superior perceptual quality and improvements in evaluation metrics, without incurring a loss of sample diversity. The method can be applied to many existing models and architectures. In this work we extend the PixelCNN model with AIQN and demonstrate results on CIFAR-10 and ImageNet using Inception scores, FID, non-cherry-picked samples, and inpainting results. We consistently observe that AIQN yields a highly stable algorithm that improves perceptual quality while maintaining a highly diverse distribution.

Fri 13 July 7:40 - 7:50 PDT

Stochastic Video Generation with a Learned Prior

Emily Denton · Rob Fergus

Generating video frames that accurately predict future world states is challenging. Existing approaches either fail to capture the full distribution of outcomes, or yield blurry generations, or both. In this paper we introduce a video generation model with a learned prior over stochastic latent variables at each time step. Video frames are generated by drawing samples from this prior and combining them with a deterministic estimate of the future frame. The approach is simple and easily trained end-to-end on a variety of datasets. Sample generations are both varied and sharp, even many frames into the future, and compare favorably to those from existing approaches.

Fri 13 July 7:50 - 8:00 PDT

Disentangled Sequential Autoencoder

Yingzhen Li · Stephan Mandt

We present a VAE architecture for encoding and generating high dimensional sequential data, such as video or audio. Our deep generative model learns a latent representation of the data which is split into a static and dynamic part, allowing us to approximately disentangle latent time-dependent features (dynamics) from features which are preserved over time (content). This architecture gives us partial control over generating content and dynamics by conditioning on either one of these sets of features. In our experiments on artificially generated cartoon video clips and voice recordings, we show that we can convert the content of a given sequence into another one by such content swapping. For audio, this allows us to convert a male speaker into a female speaker and vice versa, while for video we can separately manipulate shapes and dynamics. Furthermore, we give empirical evidence for the hypothesis that stochastic RNNs as latent state models are more efficient at compressing and generating long sequences than deterministic ones, which may be relevant for applications in video compression.