Session
Deep Generative Models
Sum-of-Squares Polynomial Flow
Priyank Jaini · Kira A. Selby · Yaoliang Yu
Triangular map is a recent construct in probability theory that allows one to transform any source probability density function to any target density function. Based on triangular maps, we propose a general framework for high-dimensional density estimation, by specifying one-dimensional transformations (equivalently conditional densities) and appropriate conditioner networks. This framework (a) reveals the commonalities and differences of existing autoregressive and flow based methods, (b) allows a unified understanding of the limitations and representation power of these recent approaches and, (c) motivates us to uncover a new Sum-of-Squares (SOS) flow that is interpretable, universal, and easy to train. We perform several synthetic experiments on various density geometries to demonstrate the benefits (and short-comings) of such transformations. SOS flows achieve competitive results in simulations and several real-world datasets.
FloWaveNet : A Generative Flow for Raw Audio
Sungwon Kim · Sang-gil Lee · Jongyoon Song · Jaehyeon Kim · Sungroh Yoon
Most modern text-to-speech architectures use a WaveNet vocoder for synthesizing high-fidelity waveform audio, but there have been limitations, such as high inference time, in practical applications due to its ancestral sampling scheme. The recently suggested Parallel WaveNet and ClariNet has achieved real-time audio synthesis capability by incorporating inverse autoregressive flow (IAF) for parallel sampling. However, these approaches require a two-stage training pipeline with a well-trained teacher network and can only produce natural sound by using probability distillation along with heavily-engineered auxiliary loss terms. We propose FloWaveNet, a flow-based generative model for raw audio synthesis. FloWaveNet requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms, and it is inherently parallel due to the characteristics of generative flow. The model can efficiently sample raw audio in real-time, with clarity comparable to previous two-stage parallel models. The code and samples for all models, including our FloWaveNet, are available on GitHub.
Are Generative Classifiers More Robust to Adversarial Attacks?
Yingzhen Li · John Bradshaw · Yash Sharma
There is a rising interest in studying the robustness of deep neural network classifiers against adversaries, with both advanced attack and defence techniques being actively developed. However, most recent work focuses on discriminative classifiers, which only model the conditional distribution of the labels given the inputs. In this paper, we propose and investigate the deep Bayes classifier, which improves classical naive Bayes with conditional deep generative models. We further develop detection methods for adversarial examples, which reject inputs with low likelihood under the generative model. Experimental results suggest that deep Bayes classifiers are more robust than deep discriminative classifiers, and that the proposed detection methods are effective against many recently proposed attacks.
A Gradual, Semi-Discrete Approach to Generative Network Training via Explicit Wasserstein Minimization
Yucheng Chen · Matus Telgarsky · Chao Zhang · Bolton Bailey · Daniel Hsu · Jian Peng
This paper provides a simple procedure to fit generative networks to target distributions, with the goal of a small Wasserstein distance (or other optimal transport costs). The approach is based on two principles: (a) if the source randomness of the network is a continuous distribution (the ``semi-discrete'' setting), then the Wasserstein distance is realized by a deterministic optimal transport mapping; (b) given an optimal transport mapping between a generator network and a target distribution, the Wasserstein distance may be reduced via a regression between the generated data and the mapped target points. The procedure here therefore alternates these two steps, forming an optimal transport and regressing against it, gradually adjusting the generator network towards the target distribution. Mathematically, this approach is shown to both minimize the Wasserstein distance to both the empirical target distribution, and its underlying population counterpart. Empirically, good performance is demonstrated on the training and test sets of MNIST, and Thin-8 datasets. The paper closes with a discussion of the unsuitability of the Wasserstein distance for certain tasks, as has been identified in prior work (Arora et al., 2017; Huang et al., 2017).
Disentangling Disentanglement in Variational Autoencoders
Emile Mathieu · Tom Rainforth · N Siddharth · Yee-Whye Teh
We develop a generalisation of disentanglement in variational auto-encoders (VAEs)—decomposition of the latent representation—characterising it as the fulfilment of two factors: a) the latent encodings of the data having an appropriate level of overlap, and b) the aggregate encoding of the data conforming to a desired structure, represented through the prior. Decomposition permits disentanglement, i.e. explicit independence between latents, as a special case, but also allows for a much richer class of properties to be imposed on the learnt representation, such as sparsity, clustering, independent subspaces, or even intricate hierarchical dependency relationships. We show that the beta-VAE varies from the standard VAE predominantly in its control of latent overlap and that for the standard choice of an isotropic Gaussian prior, its objective is invariant to rotations of the latent representation. Viewed from the decomposition perspective, breaking this invariance with simple manipulations of the prior can yield better disentanglement with little or no detriment to reconstructions. We further demonstrate how other choices of prior can assist in producing different decompositions and introduce an alternative training objective that allows the control of both decomposition factors in a principled manner.
EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE
Chao Ma · Sebastian Tschiatschek · Konstantina Palla · Jose Miguel Hernandez-Lobato · Sebastian Nowozin · Cheng Zhang
Making decisions requires information relevant to the task at hand. Many real-life decision-making situations allow further relevant information to be acquired at a specific cost. For example, in assessing the health status of a patient we may decide to take additional measurements such as diagnostic tests or imaging scans before making a final assessment. Acquiring more relevant information enables better decision making, but may be costly. How can we trade off the desire to make good decisions by acquiring further information with the cost of performing that acquisition?
To this end, we propose a principled framework, named EDDI (Efficient Dynamic Discovery of high-value Information), based on the theory of Bayesian experimental design. In EDDI we propose a novel \emph{partial variational autoencoder} (Partial VAE), to efficiently handle missing data with different missing patterns. EDDI combines this Partial VAE with an acquisition function that maximizes expected information gain on a set of target variables. EDDI is efficient and demonstrates that dynamic discovery of high-value information is possible; we show cost reduction at the same decision quality and improved decision quality at the same cost in benchmarks and two health-care applications. We are confident that there is great potential for realizing these gains in real-world decision support systems.
A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning
Yoshihiro Nagano · Shoichiro Yamaguchi · Yasuhiro Fujita · Masanori Koyama
Hyperbolic space is a geometry that is known to be well-suited for representation learning of data with an underlying hierarchical structure. In this paper, we present a novel hyperbolic distribution called \textit{pseudo-hyperbolic Gaussian}, a Gaussian-like distribution on hyperbolic space whose density can be evaluated analytically and differentiated with respect to the parameters. Our distribution enables the gradient-based learning of the probabilistic models on hyperbolic space that could never have been considered before. Also, we can sample from this hyperbolic probability distribution without resorting to auxiliary means like rejection sampling. As applications of our distribution, we develop a hyperbolic-analog of variational autoencoder and a method of probabilistic word embedding on hyperbolic space. We demonstrate the efficacy of our distribution on various datasets including MNIST, Atari 2600 Breakout, and WordNet.
Emerging Convolutions for Generative Normalizing Flows
Emiel Hoogeboom · Rianne Van den Berg · Max Welling
Generative flows are attractive because they admit exact likelihood optimization and efficient image synthesis. Recently, Kingma & Dhariwal (2018) demonstrated with Glow that generative flows are capable of generating high quality images. We generalize the 1 × 1 convolutions proposed in Glow to invertible d × d convolutions, which are more flexible since they operate on both channel and spatial axes. We propose two methods to produce invertible convolutions, that have receptive fields identical to standard convolutions: Emerging convolutions are obtained by chaining specific autoregressive convolutions, and periodic convolutions are decoupled in the frequency domain. Our experiments show that the flexibility of d × d convolutions significantly improves the performance of generative flow models on galaxy images, CIFAR10 and ImageNet.
A Large-Scale Study on Regularization and Normalization in GANs
Karol Kurach · Mario Lucic · Xiaohua Zhai · Marcin Michalski · Sylvain Gelly
Generative adversarial networks (GANs) are a class of deep generative models which aim to learn a target distribution in an unsupervised fashion. While they were successfully applied to many problems, training a GAN is a notoriously challenging task and requires a significant amount of hyperparameter tuning, neural architecture engineering, and a non-trivial amount of ``tricks". The success in many practical applications coupled with the lack of a measure to quantify the failure modes of GANs resulted in a plethora of proposed losses, regularization and normalization schemes, as well as neural architectures. In this work we take a sober view of the current state of GANs from a practical perspective. We discuss common pitfalls and reproducibility issues, open-source our code on Github, and provide pre-trained models on TensorFlow Hub.
Variational Annealing of GANs: A Langevin Perspective
Chenyang Tao · Shuyang Dai · Liqun Chen · Ke Bai · Junya Chen · Chang Liu · RUIYI (ROY) ZHANG · Georgiy Bobashev · Lawrence Carin
The generative adversarial network (GAN) has received considerable attention recently as a model for data synthesis, without an explicit specification of a likelihood function. There has been commensurate interest in leveraging likelihood estimates to improve GAN training. To enrich the understanding of this fast-growing yet almost exclusively heuristic-driven subject, we elucidate the theoretical roots of some of the empirical attempts to stabilize and improve GAN training with the introduction of likelihoods. We highlight new insights from variational theory of diffusion processes to derive a likelihood-based regularizing scheme for GAN training, and present a novel approach to train GANs with an unnormalized distribution instead of empirical samples. To substantiate our claims, we provide experimental evidence on how our theoretically-inspired new algorithms improve upon current practice.