Timezone: »
Models of negative dependence are increasingly important in machine learning. Whether selecting training data, finding an optimal experimental design, exploring in reinforcement learning, or making suggestions with recommender systems, selecting highquality but diverse items has become a core challenge. This workshop aims to bring together researchers who, using theoretical or applied techniques, leverage negative dependence in their work. We will delve into the rich underlying mathematical theory, understand key applications, and discuss the most promising directions for future research.
Fri 8:45 a.m.  8:50 a.m.
[iCal]

Opening Remarks
()
Video »


Fri 8:50 a.m.  9:30 a.m.
[iCal]

VictorEmmanuel Brunel: Negative Association and Discrete Determinantal Point Processes
(Invited talk)
»
Video »
Discrete Determinantal Point Processes (DPPs) form a class of probability distributions that can describe the random selection of items from a finite or countable collection. They naturally arise in many problems in probability theory, and they have gained a lot of attention in machine learning, due to both their modeling flexibility and their tractability. In the finite case, a DPP is parametrized by a matrix, whose principal minors are the weights given by the DPP to each possible subset of items. When the matrix is symmetric, the DPP has a very special property, called Negative Association. Thanks to this property, symmetric DPPs enforce diversity within the randomly selected items, which is a feature that is sought for in many applications of Machine Learning, such as recommendation systems. 
VictorEmmanuel Brunel 
Fri 9:30 a.m.  10:10 a.m.
[iCal]

Aarti Singh: Experimental Design
(Invited talk)
»
TBD 
Aarti Singh 
Fri 10:10 a.m.  10:30 a.m.
[iCal]

On Two Ways to use Determinantal Point Processes for Monte Carlo Integration
(Contributed talk)
»
Video »
This paper focuses on Monte Carlo integration with determinantal point processes (DPPs) which enforce negative dependence between quadrature nodes. We survey the properties of two unbiased Monte Carlo estimators of the integral of inter est: a direct one proposed by Bardenet & Hardy (2016) and a less obvious 60yearold estimator by Ermakov & Zolotukhin (1960) that actually also relies on DPPs. We provide an efficient implementation to sample exactly a particular multidimensional DPP called multivariate Jacobi ensemble. This let us investigate the behavior of both estimators on toy problems in yet unexplored regimes. 
Guillaume Gautier 
Fri 10:30 a.m.  11:00 a.m.
[iCal]

[Coffee Break]
(Break)


Fri 11:00 a.m.  11:40 a.m.
[iCal]

Jeff Bilmes: Deep Submodular Synergies
(Invited Talk)
»
Video »
Submodularity is an attractive framework in machine learning to model concepts such as diversity, dispersion, and cooperative costs, and is having an ever increasing impact on the field of machine learning. Deep learning is having a bit of success as well. In this talk, we will discuss synergies, where submodular functions and deep neural networks can be used together to their mutual benefit. First, we'll discuss deep submodular functions (DSFs), an expressive class of functions that include many widely used submodular functions and that are defined analogously to deep neural networks (DNN). We'll show that the class of DSFs strictly increases with depth and discuss applications. Second, we'll see how a modification to DNN autoencoders can produce features that can be used in DSFs. These DSF/DNN hybrids address an open problem which is how best to produce a submodular function for your application. Third, we'll see how submodular functions can speed up the training of models. In one case, submodularity can be used to produce a sequence of minibatches that speeds up training of DNN systems. In another case, submodularity can produce a training data subset for which we can show faster convergence to the optimal solution in the convex case. Empirically, this method speeds up gradient methods by up to 10x for convex and 3x for nonconvex (i.e., deep) functions. The above discusses various projects that were performed jointly with Wenruo Bai, Shengjie Wang, Chandrashekhar Lavania, Baharan Mirzasoleiman, and Jure Leskovec. 
Jeff Bilmes 
Fri 11:40 a.m.  12:00 p.m.
[iCal]

Submodular Batch Selection for Training Deep Neural Networks
(Contributed Talk)
»
Video »
Minibatch gradient descent based methods are the de facto algorithms for training neural network architectures today. We introduce a minibatch selection strategy based on submodular function maximization. Our novel submodular formulation captures the informativeness of each sample and diversity of the whole subset. We design an efficient, greedy algorithm which can give highquality solutions to this NPhard combinatorial optimization problem. Our extensive experiments on standard datasets show that the deep models trained using the proposed batch selection strategy provide better generalization than Stochastic Gradient Descent as well as a popular baseline sampling strategy across different learning rates, batch sizes, and distance metrics. 
Vineeth N Balasubramanian 
Fri 12:00 p.m.  2:00 p.m.
[iCal]

[Lunch Break]
(Break)


Fri 2:00 p.m.  2:40 p.m.
[iCal]

Michal Valko: How Negative Dependence Broke the Quadratic Barrier for Learning with Graphs and Kernels
(Invited Talk)
»
Video »
As we advance with resources, we move from reasoning on entities to reasoning on pairs and groups. We have beautiful frameworks: graphs, kernels, DPPs. However, the methods that work with pairs, relationships, and similarity are just slow. Kernel regression, or secondorder gradient methods, or sampling from DPPs do not scale to large data, because of the costly construction and storing of matrix Kn. Prior work showed that sampling points according to their ridge leverage scores (RLS) generates small dictionaries with strong spectral approximation guarantees for Kn. However, computing exact RLS requires building and storing the whole kernel matrix. In this talk, we start with SQUEAK, a new online approach for kernel approximations based on RLS sampling that sequentially processes the data, storing a dictionary with a number of points that only depends on the effective dimension deff(gamma) of the dataset. The beauty of negative dependence, that we estimate on the fly, makes it possible to exclude huge portions of dictionary. With the small dictionary, SQUEAK never constructs the whole matrix Kn, runs in linear time O(nd_eff(gamma)^3) w.r.t. n, and requires only a single pass over the dataset. A distributed version of SQUEAK runs in as little as O(log(n)d_eff(gamma)^3) time. This online tool opens out a range of possibilities to finally have scalable, adaptive, and provably accurate kernel methods: semisupervised learning or Laplacian smoothing on large graphs, scalable GPUCB, efficient secondorder kernel online learning, and even fast DPP sampling, some of these being featured in this workshop. 
Michal Valko 
Fri 2:40 p.m.  3:00 p.m.
[iCal]

Exact Sampling of Determinantal Point Processes with Sublinear Time Preprocessing
(Contributed Talk)
»
Video »
We study the complexity of sampling from a distribution over all index subsets of the set {1,...,n} with the probability of a subset S proportional to the determinant of the submatrix LS of some n x n p.s.d. matrix L, where LS corresponds to the entries of L indexed by S. Known as a determinantal point process, this distribution is widely used in machine learning to induce diversity in subset selection. In practice, we often wish to sample multiple subsets S with small expected size k = E[S] << n from a very large matrix L, so it is important to minimize the preprocessing cost of the procedure (performed once) as well as the sampling cost (performed repeatedly). To that end, we propose a new algorithm which, given access to L, samples exactly from a determinantal point process while satisfying the following two properties: (1) its preprocessing cost is n x poly(k) (sublinear in the size of L) and (2) its sampling cost is poly(k) (independent of the size of L). Prior to this work, stateoftheart exact samplers required O(n^3) preprocessing time and sampling time linear in n or dependent on the spectral properties of L. 
Michal Derezinski 
Fri 3:00 p.m.  3:30 p.m.
[iCal]

[Coffee Break]
(Break)


Fri 3:30 p.m.  4:10 p.m.
[iCal]

Sergei Levine: Distribution Matching and Mutual Information in Reinforcement Learning
(Invited Talk)
»
Video »
Conventionally, reinforcement learning is considered to be a framework for optimization: the aim for standard reinforcement learning algorithms is to recover an optimal or nearoptimal policy that maximizes the reward over time. However, when considering more advanced reinforcement learning problems, from inverse reinforcement learning to unsupervised and hierarchical reinforcement learning, we often encounter settings where it is desirable to learn policies that match target distributions over trajectories or states, covering all modes, or else to simply learn collections of behaviors that are as broad and varied as possible. Information theory and probabilistic inference offer is a powerful set of tools for developing algorithms for these kinds of distribution matching problems. In this talk, I will outline methods that combine reinforcement learning, inference, and information theory to learn policies that match target distributions and acquire diverse behaviors, and discuss the applications of such methods for a variety of problems in artificial intelligence and robotics. 
Sergey Levine 
Fri 4:10 p.m.  4:30 p.m.
[iCal]

Seq2Slate: Reranking and Slate Optimization with RNNs
(Contributed Talk)
»
Video »
Ranking is a central task in machine learning and information retrieval. In this task, it is especially important to present the user with a slate of items that is appealing as a whole. This in turn requires taking into account interactions between items, since intuitively, placing an item on the slate affects the decision of which other items should be placed alongside it. In this work, we propose a sequencetosequence model for ranking called seq2slate. At each step, the model predicts the next "best" item to place on the slate given the items already selected. The sequential nature of the model allows complex dependencies between the items to be captured directly in a flexible and scalable way. We show how to learn the model endtoend from weak supervision in the form of easily obtained clickthrough data. We further demonstrate the usefulness of our approach in experiments on standard ranking benchmarks as well as in a realworld recommendation system. 
Ofer Meshi 
Fri 4:30 p.m.  4:40 p.m.
[iCal]

[Mini Break]
(Break)


Fri 4:40 p.m.  5:20 p.m.
[iCal]

Cheng Zhang: Active MiniBatch Sampling using Repulsive Point Processes
(Invited Talk)
»
Video »
We explore active minibatch selection using repulsive point processes for stochastic gradient descent (SGD). Our approach simultaneously introduces active bias and leads to stochastic gradients with lower variance. We show theoretically and empirically that our approach improves over standard SGD both in terms of convergence speed as well as final model performance. 
Cheng Zhang 
Fri 5:20 p.m.  5:40 p.m.
[iCal]

Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret
(Contributed Talk)
»
Video »
Gaussian processes (GP) are a popular Bayesian approach for the optimization of blackbox functions. Despite their effectiveness in simple problems, GPbased algorithms hardly scale to complex highdimensional functions, as their periteration time and space cost is at least quadratic in the number of dimensions $d$ and iterations~$t$. Given a set of $A$ alternative to choose from, the overall runtime $O(t^3A)$ quickly becomes prohibitive. In this paper, we introduce BKB (budgeted kernelized bandit), an approximate GP algorithm for optimization under bandit feedback that achieves nearoptimal regret (and hence nearoptimal convergence rate) with nearconstant periteration complexity and no assumption on the input space or the GP's covariance.
Combining a kernelized linear bandit algorithm (GPUCB) with randomized matrix sketching technique (i.e., leverage score sampling), we prove that selecting inducing points based on their posterior variance gives an accurate lowrank approximation of the GP, preserving variance estimates and confidence intervals. As a consequence, BKB does not suffer from variance starvation, an important problem faced by many previous sparse GP approximations. Moreover, we show that our procedure selects at most $\widetilde{O}(d_{eff})$ points, where $d_{eff}$ is the \emph{effective} dimension of the explored space, which is typically much smaller than both $d$ and $t$. This greatly reduces the dimensionality of the
problem, thus leading to a $\widetilde{O}(TAd_{eff}^2)$ runtime and $\widetilde{O}(A d_{eff})$ space complexity.

Michal Valko 
Fri 5:40 p.m.  6:00 p.m.
[iCal]

Towards Efficient Evaluation of Risk via Herding
(Contributed Talk)
»
We introduce a novel use of herding to address the problem of selecting samples from a large unlabeled dataset to efficiently evaluate the risk of a given model. Herding is an algorithm which elaborately draws samples to approximate the underlying distribution. We use herding to select the most informative samples and show that the loss evaluated on $k$ samples produced by herding converges to the expected loss at a rate $\mathcal{O}(1/k)$, which is much faster than $\mathcal{O}(1/\sqrt{k})$ for iid random sampling. We validate our analysis on both synthetic data and real data, and further explore the empirical performance of herdingbased sampling in different cases of highdimensional data.

Zelai Xu 
Fri 6:00 p.m.  6:05 p.m.
[iCal]

Closing Remarks
()

Author Information
Mike Gartrell (Criteo AI Lab)
Jennifer Gillenwater (Google Research NYC)
Alex Kulesza (Google)
Zelda Mariet (Massachusetts Institute of Technology)
More from the Same Authors

2020 Workshop: Negative Dependence and Submodularity: Theory and Applications in Machine Learning »
Zelda Mariet · Michal Derezinski · Mike Gartrell 
2019 Poster: Bounding User Contributions: A BiasVariance Tradeoff in Differential Privacy »
Kareem Amin · Alex Kulesza · andres munoz · Sergei Vassilvitskii 
2019 Oral: Bounding User Contributions: A BiasVariance Tradeoff in Differential Privacy »
Kareem Amin · Alex Kulesza · andres munoz · Sergei Vassilvitskii 
2019 Poster: A TreeBased Method for Fast Repeated Sampling of Determinantal Point Processes »
Jennifer Gillenwater · Alex Kulesza · Zelda Mariet · Sergei Vassilvitskii 
2019 Oral: A TreeBased Method for Fast Repeated Sampling of Determinantal Point Processes »
Jennifer Gillenwater · Alex Kulesza · Zelda Mariet · Sergei Vassilvitskii