Negative Dependence and Submodularity: Theory and Applications in Machine Learning

Zelda Mariet, Michal Derezinski, Mike Gartrell

Keywords:  submodularity    negative dependence    subset selection    recommender systems    experimental design    determinantal point processes  


Models of negative dependence and submodularity are increasingly important in machine learning. Whether selecting training data, finding an optimal experimental design, exploring in reinforcement learning and Bayesian optimization, or designing recommender systems, selecting high-quality yet diverse items has become a core challenge. This workshop aims to bring together researchers who, using theoretical or applied techniques, leverage negative dependence and submodularity in their work. Expanding upon last year's workshop, we will highlight recent developments in the rich mathematical theory of negative dependence, cover novel critical applications, and discuss the most promising directions for future research.

Chat is not available.

Timezone: »


Sat 5:00 a.m. - 5:15 a.m. [iCal]
Opening remarks (Talk)
Zelda Mariet, Mike Gartrell, Michal Derezinski
Sat 5:15 a.m. - 5:45 a.m. [iCal]

Reinforcement learning has seen major success in games and other artificial environments, but its applications in industries and real life are still limited. This limited applicability is partly due to the requirement of the large amount of the training data that needs to be collected through trial and error as well as the difficulty in effectively dealing with multiple or many agents. Diversity and negative dependence are a promising approach to resolve some of the major challenges in today’s reinforcement learning and have gained increasing attention in recent years. In this talk, we will briefly review some of the approaches to introducing diversity in reinforcement learning with a focus on the use of determinantal point processes for effective multi-agent reinforcement learning.

Takayuki Osogami
Sat 5:45 a.m. - 5:50 a.m. [iCal]
Diversity in reinforcement learning (Q&A)
Sat 5:50 a.m. - 6:20 a.m. [iCal]
Eigenvalues of many models of random matrices tend to repell each other. They sometimes repell so much that sample averages over these eigenvalues converge much faster than if the eigenvalues were i.i.d. I will first show how to transform this kind of result into a generic importance sampler with mean square error decreasing as $N^{-1-1/d}$, where $N$ is the number of integrand evaluations, and $d$ is the ambient dimension. This result crucially depends on a repulsive point process for the integration nodes, called an orthogonal polynomial ensemble, itself a particular case of determinantal point process (DPP). With more assumptions on the integrand and more general DPPs, I will then show how to obtain faster Monte Carlo rates. Further generalizing to mixtures of DPPs, I will finally show how to obtain tight integration rates for integrands in a large class of reproducing kernel Hilbert spaces. This last result involves a continuous equivalent to volume sampling, a discrete point process of recent interest in numerical linear algebra. This talk is intended to connect to a few other talks of the day, and is based on the following papers:
Rémi Bardenet
Sat 6:20 a.m. - 6:25 a.m. [iCal]
From random matrices to kernel quadrature: how repulsiveness can speed up Monte Carlo integration (Q&A)
Sat 6:25 a.m. - 6:40 a.m. [iCal]

By using the framework of Determinantal Point Processes (DPPs), some theoretical results concerning the interplay between diversity and regularization can be obtained. In this paper we show that sampling subsets with kDPPs results in implicit regularization in the context of ridgeless Kernel Regression. Furthermore, we leverage the common setup of state-of-the-art DPP algorithms to sample multiple small subsets and use them in an ensemble of ridgeless regressions. Our first empirical results indicate that ensemble of ridgeless regressors can be interesting to use for datasets including redundant information.

Joachim Schreurs, Michaël Fanuel, Johan Suykens
Sat 7:10 a.m. - 7:40 a.m. [iCal]

DPP MAP inference, the problem of finding the highest probability set under the distribution defined by a DPP, is of practical importance for a variety of areas such as recommender systems, active learning, and data compression. Unfortunately, finding the exact MAP solution is NP-hard. Often though, the standard greedy submodular maximization algorithm works well in practice for approximating the solution. In this talk, we discuss ways to speed up this simple greedy algorithm, as well as slower, but more accurate alternatives to it. We also discuss how to scale greedy for customized DPPs, where we want to solve the MAP problem multiple times with different weightings of item features. We conclude with a brief note on the complexity of MAP for nonsymmetric DPPs, where we show that greedy scales fairly well if we assume a particular kernel decomposition.

Jennifer Gillenwater
Sat 7:40 a.m. - 7:45 a.m. [iCal]
Scaling DPP MAP Inference Q&A (Q&A)
Sat 7:45 a.m. - 8:15 a.m. [iCal]

Probability distributions with strong notions of negative dependence arise in various forms in machine learning. Examples include diversity-inducing probabilistic models, interpretability, exploration and active learning, and randomized algorithms. While, perhaps surprisingly, being more delicate than its positive counterpart, negative dependence enjoys rich mathematical connections and properties that offer a promising toolbox for machine learning. In this talk, I will summarize some recently important notions of negative dependence, and their implications for sampling algorithms. These results exploit connections to the geometry of polynomials, log concavity, and submodular optimization. We will conclude with an example application of sampling minibatches for optimization.

Stefanie Jegelka
Sat 8:15 a.m. - 8:20 a.m. [iCal]
Negative Dependence and Sampling (Q&A)
Sat 8:20 a.m. - 8:35 a.m. [iCal]
In this paper, we propose scalable methods for maximizing a regularized submodular function $f = g- \ell$ expressed as the difference between a monotone submodular function $g$ and a modular function $\ell$. Indeed, submodularity is inherently related to the notions of diversity, coverage, and representativeness. In particular, finding the mode (i.e., the most likely configuration) of many popular probabilistic models of diversity, such as determinantal point processes, submodular probabilistic models, and strongly log-concave distributions, involves maximization of (regularized) submodular functions. Since a regularized function$ f$ can potentially take on negative values, the classic theory of submodular maximization, which heavily relies on the non-negativity assumption of submodular functions, may not be applicable. To circumvent this challenge, we develop the first streaming and distributed algorithm for maximizing regularized submodular functions. Furthermore, we show that how can we find the mode of a strongly log-concave (SLC) distribution by regularized submodular maximization.
Ehsan Kazemi, Amin Karbasi, Moran Feldman
Sat 9:30 a.m. - 10:15 a.m. [iCal]

Submodular maximization

zoom link

  • Online Algorithms for Budget-Constrained DR-Submodular Maximization, Omid Sadeghi, Reza Eghbali, Maryam Fazel
  • Constrained Maximization of Lattice Submodular Functions, Aytunc Sahin, Joachim Buhmann, Andreas Krause
  • Mode Finding for SLC Distributions via Regularized Submodular Maximization, Ehsan Kazemi, Shervin Minaee, Moran Feldman, Amin Karbasi

Determinantal point processes

zoom link

  • MO-PaDGAN: Generating Diverse Designs with Multivariate Performance Enhancement, Wei Chen, Faez Ahmed
  • On the Relationship Between Probabilistic Circuits and Determinantal Point Processes, Honghua Zhang, Steven J Holtzen, Guy Van den Broeck
  • Ensemble Kernel Methods, Implicit Regularization and Determinantal Point Processes, Joachim Schreurs, Michaël Fanuel, Johan Suykens

Negative dependence for inference and bipartite matching

zoom link

  • DisARM: An Antithetic Gradient Estimator for Binary Latent Variables, Zhe Dong, Andriy Mnih, George Tucker
  • On Diverse Bipartite b-Matching, Saba Ahmadi, Faez Ahmed, John P Dickerson, Mark Fuge, Samir Khuller
  • Negative Dependence Tightens Variational Bounds, Pierre-Alexandre Mattei, Jes Frellsen
Sat 10:15 a.m. - 10:45 a.m. [iCal]

In this talk I’ll describe a novel approach that yields algorithms whose parallel running time is exponentially faster than any algorithm previously known for a broad range of machine learning applications. The algorithms are designed for submodular function maximization which is the algorithmic engine behind applications such as clustering, network analysis, feature selection, Bayesian inference, ranking, speech and document summarization, recommendation systems, hyperparameter tuning, and many others. Since applications of submodular functions are ubiquitous across machine learning and data sets become larger, there is consistent demand for accelerating submodular optimization. The approach we describe yields simple algorithms whose parallel runtime is logarithmic in the size of the data rather than linear. I’ll introduce the frameworks we recently developed and present experimental results from various application domains.

Yaron Singer
Sat 10:45 a.m. - 10:50 a.m. [iCal]
Exponentially Faster Algorithms for Machine Learning (Q&A)
Sat 10:50 a.m. - 11:20 a.m. [iCal]

A central challenge in biotechnology is to be able to predict functional properties of a protein from its sequence, and thus (i) discover new proteins with specific functionality and (ii) better understand the functional effect of genomic mutations. Experimental breakthroughs in our ability to read and write DNA allows data on the relationship between sequence and function to be rapidly acquired. This data can be used to train and validate machine learning models that predict protein function from sequence. However, the cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds, where each round contains large batches of sequence designs. In this setting, model-based optimization allows us to take advantage of sample inefficient methods to find diverse optimal sequence candidates to be tested in the wet-lab. These requirements are illustrated by a collaboration that involves the design and experimental validation of AAV capsid protein variants that assemble integral capsids and package their genome, for use in gene therapy applications.

Lucy Colwell
Sat 11:20 a.m. - 11:25 a.m. [iCal]
Searching for Diverse Biological Sequences (Q&A)
Sat 11:25 a.m. - 11:40 a.m. [iCal]

Submodular optimization over the integer lattice has many applications in machine learning. Although the constrained maximization of submodular functions with coordinate-wise concavity (also called {\em DR-Submodular} functions) is well studied, the maximization of {\em general} lattice submodular functions is considerably more challenging. In this work, we first show that we can optimize lattice submodular functions subject to a discrete (integer) polymatroid constraint using a recently proposed extension, called the Generalized Multilinear Extension. Then, we establish a bound on the rounding error for the discrete polymatroid constraint, which depends on the “distance” between the lattice submodular function to a DR-Submodular function. Lastly, we demonstrate the effectiveness of our algorithm on a Bayesian experimental design problem with repetition and a concave cost.

Aytunc Sahin, Joachim Buhmann, Andreas Krause
Sat 12:10 p.m. - 12:40 p.m. [iCal]

Randomized Numerical Linear Algebra (RandNLA) is an area which uses randomness, most notably random sampling and random projection methods, to develop improved algorithms for ubiquitous matrix problems, such as those that arise in scientific computing, data science, machine learning, etc. A seemingly different topic, but one which has a long history in pure and applied mathematics, is that of Determinantal Point Processes (DPPs), which are stochastic point processes, the probability distribution of which is characterized by sub-determinants of some matrix. Recent work has uncovered deep and fruitful connections between DPPs and RandNLA. For example, random sampling with a DPP leads to new kinds of unbiased estimators for classical RandNLA tasks, enabling more refined statistical and inferential understanding of RandNLA algorithms; a DPP is, in some sense, an optimal randomized method for many RandNLA problems; and a standard RandNLA technique, called leverage score sampling, can be derived as the marginal distribution of a DPP. This work will be reviewed, as will recent algorithmic developments, illustrating that, while not quite as efficient as simply applying a random projection, these DPP-based algorithms are only moderately more expensive.

Michael Mahoney
Sat 12:40 p.m. - 12:45 p.m. [iCal]
Determinantal Point Processes in Randomized Numerical Linear Algebra (Q&A)
Sat 12:45 p.m. - 1:00 p.m. [iCal]

Scaling probabilistic models to large realistic problems and datasets is a key challenge in machine learning. Central to this effort is the development of tractable probabilistic models (TPMs): models whose structure guarantees efficient probabilistic inference algorithms. The current landscape of TPMs is fragmented: there exist various kinds of TPMs with different strengths and weaknesses. Two of the most prominent classes of TPMs are determinantal point processes (DPPs) and probabilistic circuits (PCs). This paper provides the first systematic study of their relationship. We propose a unified analysis and shared language for discussing DPPs and PCs. Then we establish theoretical barriers for the unification of these two families, and prove that there are cases where DPPs have no compact representation as a class of PCs. We close with a perspective on the central problem of unifying these models.

Honghua Zhang, Steven Holtzen, Guy Van den Broeck
Sat 1:00 p.m. - 1:45 p.m. [iCal]
Panel discussion
Sat 1:45 p.m. - 2:00 p.m. [iCal]
Closing remarks (Talk)
Zelda Mariet, Michal Derezinski, Mike Gartrell