Timezone: »

Negative Dependence: Theory and Applications in Machine Learning
Mike Gartrell · Jennifer Gillenwater · Alex Kulesza · Zelda Mariet

Fri Jun 14 08:30 AM -- 06:00 PM (PDT) @ 204
Event URL: https://negative-dependence-in-ml-workshop.lids.mit.edu »

Models of negative dependence are increasingly important in machine learning. Whether selecting training data, finding an optimal experimental design, exploring in reinforcement learning, or making suggestions with recommender systems, selecting high-quality but diverse items has become a core challenge. This workshop aims to bring together researchers who, using theoretical or applied techniques, leverage negative dependence in their work. We will delve into the rich underlying mathematical theory, understand key applications, and discuss the most promising directions for future research.

Fri 8:45 a.m. - 8:50 a.m.
Opening Remarks (-) [ Video
Fri 8:50 a.m. - 9:30 a.m.
[ Video

Discrete Determinantal Point Processes (DPPs) form a class of probability distributions that can describe the random selection of items from a finite or countable collection. They naturally arise in many problems in probability theory, and they have gained a lot of attention in machine learning, due to both their modeling flexibility and their tractability. In the finite case, a DPP is parametrized by a matrix, whose principal minors are the weights given by the DPP to each possible subset of items. When the matrix is symmetric, the DPP has a very special property, called Negative Association. Thanks to this property, symmetric DPPs enforce diversity within the randomly selected items, which is a feature that is sought for in many applications of Machine Learning, such as recommendation systems.

Victor-Emmanuel Brunel
Fri 9:30 a.m. - 10:10 a.m.


Aarti Singh
Fri 10:10 a.m. - 10:30 a.m.
[ Video

This paper focuses on Monte Carlo integration with determinantal point processes (DPPs) which enforce negative dependence between quadrature nodes. We survey the properties of two unbiased Monte Carlo estimators of the integral of inter- est: a direct one proposed by Bardenet & Hardy (2016) and a less obvious 60-year-old estimator by Ermakov & Zolotukhin (1960) that actually also relies on DPPs. We provide an efficient implementation to sample exactly a particular multidimensional DPP called multivariate Jacobi ensemble. This let us investigate the behavior of both estimators on toy problems in yet unexplored regimes.

Guillaume Gautier
Fri 10:30 a.m. - 11:00 a.m.
[Coffee Break] (Break)
Fri 11:00 a.m. - 11:40 a.m.
[ Video

Submodularity is an attractive framework in machine learning to model concepts such as diversity, dispersion, and cooperative costs, and is having an ever increasing impact on the field of machine learning. Deep learning is having a bit of success as well. In this talk, we will discuss synergies, where submodular functions and deep neural networks can be used together to their mutual benefit. First, we'll discuss deep submodular functions (DSFs), an expressive class of functions that include many widely used submodular functions and that are defined analogously to deep neural networks (DNN). We'll show that the class of DSFs strictly increases with depth and discuss applications. Second, we'll see how a modification to DNN autoencoders can produce features that can be used in DSFs. These DSF/DNN hybrids address an open problem which is how best to produce a submodular function for your application. Third, we'll see how submodular functions can speed up the training of models. In one case, submodularity can be used to produce a sequence of mini-batches that speeds up training of DNN systems. In another case, submodularity can produce a training data subset for which we can show faster convergence to the optimal solution in the convex case. Empirically, this method speeds up gradient methods by up to 10x for convex and 3x for non-convex (i.e., deep) functions.

The above discusses various projects that were performed jointly with Wenruo Bai, Shengjie Wang, Chandrashekhar Lavania, Baharan Mirzasoleiman, and Jure Leskovec.

Jeff Bilmes
Fri 11:40 a.m. - 12:00 p.m.
[ Video

Mini-batch gradient descent based methods are the de facto algorithms for training neural network architectures today. We introduce a mini-batch selection strategy based on submodular function maximization. Our novel submodular formulation captures the informativeness of each sample and diversity of the whole subset. We design an efficient, greedy algorithm which can give high-quality solutions to this NP-hard combinatorial optimization problem. Our extensive experiments on standard datasets show that the deep models trained using the proposed batch selection strategy provide better generalization than Stochastic Gradient Descent as well as a popular baseline sampling strategy across different learning rates, batch sizes, and distance metrics.

Vineeth N Balasubramanian
Fri 12:00 p.m. - 2:00 p.m.
[Lunch Break] (Break)
Fri 2:00 p.m. - 2:40 p.m.
[ Video

As we advance with resources, we move from reasoning on entities to reasoning on pairs and groups. We have beautiful frameworks: graphs, kernels, DPPs. However, the methods that work with pairs, relationships, and similarity are just slow. Kernel regression, or second-order gradient methods, or sampling from DPPs do not scale to large data, because of the costly construction and storing of matrix Kn. Prior work showed that sampling points according to their ridge leverage scores (RLS) generates small dictionaries with strong spectral approximation guarantees for Kn. However, computing exact RLS requires building and storing the whole kernel matrix. In this talk, we start with SQUEAK, a new online approach for kernel approximations based on RLS sampling that sequentially processes the data, storing a dictionary with a number of points that only depends on the effective dimension deff(gamma) of the dataset. The beauty of negative dependence, that we estimate on the fly, makes it possible to exclude huge portions of dictionary. With the small dictionary, SQUEAK never constructs the whole matrix Kn, runs in linear time O(nd_eff(gamma)^3) w.r.t. n, and requires only a single pass over the dataset. A distributed version of SQUEAK runs in as little as O(log(n)d_eff(gamma)^3) time. This online tool opens out a range of possibilities to finally have scalable, adaptive, and provably accurate kernel methods: semi-supervised learning or Laplacian smoothing on large graphs, scalable GP-UCB, efficient second-order kernel online learning, and even fast DPP sampling, some of these being featured in this workshop.

Michal Valko
Fri 2:40 p.m. - 3:00 p.m.
[ Video

We study the complexity of sampling from a distribution over all index subsets of the set {1,...,n} with the probability of a subset S proportional to the determinant of the submatrix LS of some n x n p.s.d. matrix L, where LS corresponds to the entries of L indexed by S. Known as a determinantal point process, this distribution is widely used in machine learning to induce diversity in subset selection. In practice, we often wish to sample multiple subsets S with small expected size k = E[|S|] << n from a very large matrix L, so it is important to minimize the preprocessing cost of the procedure (performed once) as well as the sampling cost (performed repeatedly). To that end, we propose a new algorithm which, given access to L, samples exactly from a determinantal point process while satisfying the following two properties: (1) its preprocessing cost is n x poly(k) (sublinear in the size of L) and (2) its sampling cost is poly(k) (independent of the size of L). Prior to this work, state-of-the-art exact samplers required O(n^3) preprocessing time and sampling time linear in n or dependent on the spectral properties of L.

Michal Derezinski
Fri 3:00 p.m. - 3:30 p.m.
[Coffee Break] (Break)
Fri 3:30 p.m. - 4:10 p.m.
[ Video

Conventionally, reinforcement learning is considered to be a framework for optimization: the aim for standard reinforcement learning algorithms is to recover an optimal or near-optimal policy that maximizes the reward over time. However, when considering more advanced reinforcement learning problems, from inverse reinforcement learning to unsupervised and hierarchical reinforcement learning, we often encounter settings where it is desirable to learn policies that match target distributions over trajectories or states, covering all modes, or else to simply learn collections of behaviors that are as broad and varied as possible. Information theory and probabilistic inference offer is a powerful set of tools for developing algorithms for these kinds of distribution matching problems. In this talk, I will outline methods that combine reinforcement learning, inference, and information theory to learn policies that match target distributions and acquire diverse behaviors, and discuss the applications of such methods for a variety of problems in artificial intelligence and robotics.

Sergey Levine
Fri 4:10 p.m. - 4:30 p.m.
[ Video

Ranking is a central task in machine learning and information retrieval. In this task, it is especially important to present the user with a slate of items that is appealing as a whole. This in turn requires taking into account interactions between items, since intuitively, placing an item on the slate affects the decision of which other items should be placed alongside it. In this work, we propose a sequence-to-sequence model for ranking called seq2slate. At each step, the model predicts the next "best" item to place on the slate given the items already selected. The sequential nature of the model allows complex dependencies between the items to be captured directly in a flexible and scalable way. We show how to learn the model end-to-end from weak supervision in the form of easily obtained click-through data. We further demonstrate the usefulness of our approach in experiments on standard ranking benchmarks as well as in a real-world recommendation system.

Ofer Meshi
Fri 4:30 p.m. - 4:40 p.m.
[Mini Break] (Break)
Fri 4:40 p.m. - 5:20 p.m.
[ Video

We explore active mini-batch selection using repulsive point processes for stochastic gradient descent (SGD). Our approach simultaneously introduces active bias and leads to stochastic gradients with lower variance. We show theoretically and empirically that our approach improves over standard SGD both in terms of convergence speed as well as final model performance.

Cheng Zhang
Fri 5:20 p.m. - 5:40 p.m.
[ Video
Gaussian processes (GP) are a popular Bayesian approach for the optimization of black-box functions. Despite their effectiveness in simple problems, GP-based algorithms hardly scale to complex high-dimensional functions, as their per-iteration time and space cost is at least quadratic in the number of dimensions $d$ and iterations~$t$. Given a set of $A$ alternative to choose from, the overall runtime $O(t^3A)$ quickly becomes prohibitive. In this paper, we introduce BKB (budgeted kernelized bandit), an approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and no assumption on the input space or the GP's covariance. Combining a kernelized linear bandit algorithm (GP-UCB) with randomized matrix sketching technique (i.e., leverage score sampling), we prove that selecting inducing points based on their posterior variance gives an accurate low-rank approximation of the GP, preserving variance estimates and confidence intervals. As a consequence, BKB does not suffer from variance starvation, an important problem faced by many previous sparse GP approximations. Moreover, we show that our procedure selects at most $\widetilde{O}(d_{eff})$ points, where $d_{eff}$ is the \emph{effective} dimension of the explored space, which is typically much smaller than both $d$ and $t$. This greatly reduces the dimensionality of the problem, thus leading to a $\widetilde{O}(TAd_{eff}^2)$ runtime and $\widetilde{O}(A d_{eff})$ space complexity.
Michal Valko
Fri 5:40 p.m. - 6:00 p.m.
We introduce a novel use of herding to address the problem of selecting samples from a large unlabeled dataset to efficiently evaluate the risk of a given model. Herding is an algorithm which elaborately draws samples to approximate the underlying distribution. We use herding to select the most informative samples and show that the loss evaluated on $k$ samples produced by herding converges to the expected loss at a rate $\mathcal{O}(1/k)$, which is much faster than $\mathcal{O}(1/\sqrt{k})$ for iid random sampling. We validate our analysis on both synthetic data and real data, and further explore the empirical performance of herding-based sampling in different cases of high-dimensional data.
Zelai Xu
Fri 6:00 p.m. - 6:05 p.m.
Closing Remarks (-)

Author Information

Mike Gartrell (Criteo AI Lab)
Jennifer Gillenwater (Google Research NYC)
Alex Kulesza (Google)
Zelda Mariet (Massachusetts Institute of Technology)

More from the Same Authors