Timezone: »

Stein’s Method for Machine Learning and Statistics
Francois-Xavier Briol · Lester Mackey · Chris Oates · Qiang Liu · Larry Goldstein · Larry Goldstein

Sat Jun 15 08:30 AM -- 06:00 PM (PDT) @ 104 A
Event URL: https://steinworkshop.github.io/ »

Stein's method is a technique from probability theory for bounding the distance between probability measures using differential and difference operators. Although the method was initially designed as a technique for proving central limit theorems, it has recently caught the attention of the machine learning (ML) community and has been used for a variety of practical tasks. Recent applications include generative modeling, global non-convex optimisation, variational inference, de novo sampling, constructing powerful control variates for Monte Carlo variance reduction, and measuring the quality of (approximate) Markov chain Monte Carlo algorithms. Stein's method has also been used to develop goodness-of-fit tests and was the foundational tool in one of the NeurIPS 2017 Best Paper awards.

Although Stein's method has already had significant impact in ML, most of the applications only scratch the surface of this rich area of research in probability theory. There would be significant gains to be made by encouraging both communities to interact directly, and this inaugural workshop would be an important step in this direction. More precisely, the aims are: (i) to introduce this emerging topic to the wider ML community, (ii) to highlight the wide range of existing applications in ML, and (iii) to bring together experts in Stein's method and ML researchers to discuss and explore potential further uses of Stein's method.

Sat 8:30 a.m. - 8:45 a.m.
Overview of the day (Talk)
Francois-Xavier Briol
Sat 8:45 a.m. - 9:45 a.m.

Stein's identity states that a mean zero, variance one random variable W has the standard normal distribution if and only if E[Wf(W)] = E[f'(W)] for all f in F, where F is the class of functions such that both sides of the equality exist. Stein (1972) used this identity as the key element in his novel method for proving the Central Limit Theorem. The method generalizes to many distributions beyond the normal, allows one to operate on random variables directly rather than through their transforms, provides non-asymptotic bounds to their limit, and handles a variety of dependence. In addition to distributional approximation, Stein's identity has connections to concentration of measure, Malliavin calculus, and statistical procedures including shrinkage and unbiased risk estimation.

Larry Goldstein
Sat 9:45 a.m. - 10:30 a.m.

Training neural networks is a challenging non-convex optimization problem. Stein’s method provides a novel way to change optimization problem to a tensor decomposition problem for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. This provides insights into role of generative process for tractability of supervised learning.

Anima Anandkumar
Sat 10:30 a.m. - 11:00 a.m.
Sat 11:00 a.m. - 11:45 a.m.

I will describe a nonparametric, kernel-based test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. The test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which require exact access to unnormalized densities. We will rely on the simple idea of using an approximate observed-variable marginal in place of the exact, intractable one. As the main theoretical contribution, the new test has a well-controlled type-I error, once we have properly corrected the threshold. In the case of models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test, which cannot exploit the latent structure.

Arthur Gretton
Sat 11:45 a.m. - 12:30 p.m.
Invited Talk - Andrew Duncan (Talk)
Andrew Duncan
Sat 12:30 p.m. - 1:45 p.m.
Lunch and Poster Session (Poster)
Sat 1:45 p.m. - 2:30 p.m.

Implicit models, which allow for the generation of samples but not for point-wise evaluation of probabilities, are omnipresent in real-world problems tackled by machine learning and a hot topic of current research. Some examples include data simulators that are widely used in engineering and scientific research, generative adversarial networks (GANs) for image synthesis, and hot-off-the-press approximate inference techniques relying on implicit distributions. Gradient based optimization/sampling methods are often applied to train these models, however without tractable densities, the objective functions often need to be approximated. In this talk I will motivate gradient estimation as another approximation approach for training implicit models and perform Monte Carlo based approximate inference. Based on this view, I will then present the Stein gradient estimator which estimates the score function of an implicit model density. I will discuss connections of this approach to score matching, kernel methods, denoising auto-encoders, etc., and show application cases including entropy regularization for GANs, and meta-learning for stochastic gradient MCMC algorithms.

Yingzhen Li
Sat 2:30 p.m. - 3:15 p.m.

Particle-based variational inference methods (ParVIs), such as models associated with Stein variational gradient descent, have gained attention in the Bayesian inference literature, for their capacity to yield flexible and accurate approximations. We explore ParVIs from the perspective of Wasserstein gradient flows, and make both theoretical and practical contributions. We unify various finite-particle approximations that existing ParVIs use, and recognize that the approximation is essentially a compulsory smoothing treatment, in either of two equivalent forms. This novel understanding reveals the assumptions and relations of existing ParVIs, and also inspires new ParVIs. We propose an acceleration framework and a principled bandwidth-selection method for general ParVIs; these are based on the developed theory and leverage the geometry of the Wasserstein space. Experimental results show the improved convergence by the acceleration framework and enhanced sample accuracy by the bandwidth-selection method.

Sat 3:15 p.m. - 3:45 p.m.
Sat 3:45 p.m. - 4:30 p.m.

After discussing the Ornstein-Uhlenbeck process and how it arises in the context of Stein's method, we turn to an analysis of the stochastic gradient descent method that drives deep learning. We show that certain noise that often arises in the training process induces an Ornstein-Uhlenbeck process on the learned parameters. This process is responsible for a weak regularization effect on the training that, once it reaches stable points, will have provable found the "simplest possible" hypothesis consistent with the training data. At a higher level, we argue how some of the big mysteries of the success of deep learning may be revealed by analyses of the subtle stochastic processes which govern deep learning training, a natural focus for this community.

Sat 4:30 p.m. - 5:15 p.m.

In this talk, we present a general Kolmogorov bound for normal approximation, proved using Stein’s method. We will apply this bound to random measures using Palm theory and coupling, with applications to stochastic geometry. We will also apply this bound to Stein couplings, which include local dependence, exchangeable pairs, size-bias couplings and more as special cases. This talk is based on joint work with Adrian Roellin and Aihua Xia.

Louis Chen
Sat 5:15 p.m. - 6:00 p.m.
Panel Discussion - All speakers (Panel)

Author Information

Francois-Xavier Briol (University of Cambridge)
Lester Mackey (Microsoft Research)
Chris Oates (Newcastle University)
Qiang Liu (UT Austin)
Larry Goldstein (University of Southern California)
Larry Goldstein (University of Southern California)

More from the Same Authors