Timezone: »
Stein's method is a technique from probability theory for bounding the distance between probability measures using differential and difference operators. Although the method was initially designed as a technique for proving central limit theorems, it has recently caught the attention of the machine learning (ML) community and has been used for a variety of practical tasks. Recent applications include generative modeling, global non-convex optimisation, variational inference, de novo sampling, constructing powerful control variates for Monte Carlo variance reduction, and measuring the quality of (approximate) Markov chain Monte Carlo algorithms. Stein's method has also been used to develop goodness-of-fit tests and was the foundational tool in one of the NeurIPS 2017 Best Paper awards.
Although Stein's method has already had significant impact in ML, most of the applications only scratch the surface of this rich area of research in probability theory. There would be significant gains to be made by encouraging both communities to interact directly, and this inaugural workshop would be an important step in this direction. More precisely, the aims are: (i) to introduce this emerging topic to the wider ML community, (ii) to highlight the wide range of existing applications in ML, and (iii) to bring together experts in Stein's method and ML researchers to discuss and explore potential further uses of Stein's method.
Sat 8:30 a.m. - 8:45 a.m.
|
Overview of the day
(Talk)
|
Francois-Xavier Briol |
Sat 8:45 a.m. - 9:45 a.m.
|
Tutorial - Larry Goldstein: The Many Faces of a Simple Identity
(Tutorial)
»
Stein's identity states that a mean zero, variance one random variable W has the standard normal distribution if and only if E[Wf(W)] = E[f'(W)] for all f in F, where F is the class of functions such that both sides of the equality exist. Stein (1972) used this identity as the key element in his novel method for proving the Central Limit Theorem. The method generalizes to many distributions beyond the normal, allows one to operate on random variables directly rather than through their transforms, provides non-asymptotic bounds to their limit, and handles a variety of dependence. In addition to distributional approximation, Stein's identity has connections to concentration of measure, Malliavin calculus, and statistical procedures including shrinkage and unbiased risk estimation. |
Larry Goldstein |
Sat 9:45 a.m. - 10:30 a.m.
|
Invited Talk - Anima Anandkumar: Stein’s method for understanding optimization in neural networks.
(Talk)
»
Training neural networks is a challenging non-convex optimization problem. Stein’s method provides a novel way to change optimization problem to a tensor decomposition problem for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. This provides insights into role of generative process for tractability of supervised learning. |
Anima Anandkumar |
Sat 10:30 a.m. - 11:00 a.m.
|
Break
|
|
Sat 11:00 a.m. - 11:45 a.m.
|
Invited Talk - Arthur Gretton: Relative goodness-of-fit tests for models with latent variables.
(Talk)
»
I will describe a nonparametric, kernel-based test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. The test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which require exact access to unnormalized densities. We will rely on the simple idea of using an approximate observed-variable marginal in place of the exact, intractable one. As the main theoretical contribution, the new test has a well-controlled type-I error, once we have properly corrected the threshold. In the case of models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test, which cannot exploit the latent structure. |
Arthur Gretton |
Sat 11:45 a.m. - 12:30 p.m.
|
Invited Talk - Andrew Duncan
(Talk)
|
Andrew Duncan |
Sat 12:30 p.m. - 1:45 p.m.
|
Lunch and Poster Session
(Poster)
|
|
Sat 1:45 p.m. - 2:30 p.m.
|
Invited Talk - Yingzhen Li: Gradient estimation for implicit models with Stein's method.
(Talk)
»
Implicit models, which allow for the generation of samples but not for point-wise evaluation of probabilities, are omnipresent in real-world problems tackled by machine learning and a hot topic of current research. Some examples include data simulators that are widely used in engineering and scientific research, generative adversarial networks (GANs) for image synthesis, and hot-off-the-press approximate inference techniques relying on implicit distributions. Gradient based optimization/sampling methods are often applied to train these models, however without tractable densities, the objective functions often need to be approximated. In this talk I will motivate gradient estimation as another approximation approach for training implicit models and perform Monte Carlo based approximate inference. Based on this view, I will then present the Stein gradient estimator which estimates the score function of an implicit model density. I will discuss connections of this approach to score matching, kernel methods, denoising auto-encoders, etc., and show application cases including entropy regularization for GANs, and meta-learning for stochastic gradient MCMC algorithms. |
Yingzhen Li |
Sat 2:30 p.m. - 3:15 p.m.
|
Invited Talk - Ruiyi Zhang: On Wasserstein Gradient Flows and Particle-Based Variational Inference
(Talk)
»
Particle-based variational inference methods (ParVIs), such as models associated with Stein variational gradient descent, have gained attention in the Bayesian inference literature, for their capacity to yield flexible and accurate approximations. We explore ParVIs from the perspective of Wasserstein gradient flows, and make both theoretical and practical contributions. We unify various finite-particle approximations that existing ParVIs use, and recognize that the approximation is essentially a compulsory smoothing treatment, in either of two equivalent forms. This novel understanding reveals the assumptions and relations of existing ParVIs, and also inspires new ParVIs. We propose an acceleration framework and a principled bandwidth-selection method for general ParVIs; these are based on the developed theory and leverage the geometry of the Wasserstein space. Experimental results show the improved convergence by the acceleration framework and enhanced sample accuracy by the bandwidth-selection method. |
RUIYI (ROY) ZHANG |
Sat 3:15 p.m. - 3:45 p.m.
|
Break
|
|
Sat 3:45 p.m. - 4:30 p.m.
|
Invited Talk - Paul Valiant: How the Ornstein-Uhlenbeck process drives generalization for deep learning.
(Talk)
»
After discussing the Ornstein-Uhlenbeck process and how it arises in the context of Stein's method, we turn to an analysis of the stochastic gradient descent method that drives deep learning. We show that certain noise that often arises in the training process induces an Ornstein-Uhlenbeck process on the learned parameters. This process is responsible for a weak regularization effect on the training that, once it reaches stable points, will have provable found the "simplest possible" hypothesis consistent with the training data. At a higher level, we argue how some of the big mysteries of the success of deep learning may be revealed by analyses of the subtle stochastic processes which govern deep learning training, a natural focus for this community. |
|
Sat 4:30 p.m. - 5:15 p.m.
|
Invited Talk - Louis Chen: Palm theory, random measures and Stein couplings.
(Talk)
»
In this talk, we present a general Kolmogorov bound for normal approximation, proved using Stein’s method. We will apply this bound to random measures using Palm theory and coupling, with applications to stochastic geometry. We will also apply this bound to Stein couplings, which include local dependence, exchangeable pairs, size-bias couplings and more as special cases. This talk is based on joint work with Adrian Roellin and Aihua Xia. |
Louis Chen |
Sat 5:15 p.m. - 6:00 p.m.
|
Panel Discussion - All speakers
(Panel)
|
Author Information
Francois-Xavier Briol (University of Cambridge)
Lester Mackey (Microsoft Research)
Chris Oates (Newcastle University)
Qiang Liu (UT Austin)
Larry Goldstein (University of Southern California)
Larry Goldstein (University of Southern California)
More from the Same Authors
-
2020 Poster: Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection »
Mao Ye · Chengyue Gong · Lizhen Nie · Denny Zhou · Adam Klivans · Qiang Liu -
2020 Poster: Go Wide, Then Narrow: Efficient Training of Deep Thin Networks »
Denny Zhou · Mao Ye · Chen Chen · Tianjian Meng · Mingxing Tan · Xiaodan Song · Quoc Le · Qiang Liu · Dale Schuurmans -
2020 Poster: Accountable Off-Policy Evaluation With Kernel Bellman Statistics »
Yihao Feng · Tongzheng Ren · Ziyang Tang · Qiang Liu -
2020 Poster: A Chance-Constrained Generative Framework for Sequence Optimization »
Xianggen Liu · Qiang Liu · Sen Song · Jian Peng -
2019 Poster: Improving Neural Language Modeling via Adversarial Training »
Dilin Wang · Chengyue Gong · Qiang Liu -
2019 Oral: Improving Neural Language Modeling via Adversarial Training »
Dilin Wang · Chengyue Gong · Qiang Liu -
2019 Poster: Quantile Stein Variational Gradient Descent for Batch Bayesian Optimization »
Chengyue Gong · Jian Peng · Qiang Liu -
2019 Poster: Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models »
Dilin Wang · Qiang Liu -
2019 Oral: Quantile Stein Variational Gradient Descent for Batch Bayesian Optimization »
Chengyue Gong · Jian Peng · Qiang Liu -
2019 Oral: Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models »
Dilin Wang · Qiang Liu -
2019 Poster: Stein Point Markov Chain Monte Carlo »
Wilson Ye Chen · Alessandro Barp · Francois-Xavier Briol · Jackson Gorham · Mark Girolami · Lester Mackey · Chris Oates -
2019 Oral: Stein Point Markov Chain Monte Carlo »
Wilson Ye Chen · Alessandro Barp · Francois-Xavier Briol · Jackson Gorham · Mark Girolami · Lester Mackey · Chris Oates -
2018 Poster: Learning to Explore via Meta-Policy Gradient »
Tianbing Xu · Qiang Liu · Liang Zhao · Jian Peng -
2018 Poster: Accurate Inference for Adaptive Linear Models »
Yash Deshpande · Lester Mackey · Vasilis Syrgkanis · Matt Taddy -
2018 Poster: Stein Variational Gradient Descent Without Gradient »
Jun Han · Qiang Liu -
2018 Poster: Stein Points »
Wilson Ye Chen · Lester Mackey · Jackson Gorham · Francois-Xavier Briol · Chris J Oates -
2018 Poster: Orthogonal Machine Learning: Power and Limitations »
Ilias Zadik · Lester Mackey · Vasilis Syrgkanis -
2018 Oral: Accurate Inference for Adaptive Linear Models »
Yash Deshpande · Lester Mackey · Vasilis Syrgkanis · Matt Taddy -
2018 Oral: Stein Points »
Wilson Ye Chen · Lester Mackey · Jackson Gorham · Francois-Xavier Briol · Chris J Oates -
2018 Oral: Orthogonal Machine Learning: Power and Limitations »
Ilias Zadik · Lester Mackey · Vasilis Syrgkanis -
2018 Oral: Stein Variational Gradient Descent Without Gradient »
Jun Han · Qiang Liu -
2018 Oral: Learning to Explore via Meta-Policy Gradient »
Tianbing Xu · Qiang Liu · Liang Zhao · Jian Peng -
2018 Poster: Goodness-of-fit Testing for Discrete Distributions via Stein Discrepancy »
Jiasen Yang · Qiang Liu · Vinayak A Rao · Jennifer Neville -
2018 Poster: Stein Variational Message Passing for Continuous Graphical Models »
Dilin Wang · Zhe Zeng · Qiang Liu -
2018 Oral: Goodness-of-fit Testing for Discrete Distributions via Stein Discrepancy »
Jiasen Yang · Qiang Liu · Vinayak A Rao · Jennifer Neville -
2018 Oral: Stein Variational Message Passing for Continuous Graphical Models »
Dilin Wang · Zhe Zeng · Qiang Liu