Session
Optimization
Submodular Maximization beyond Non-negativity: Guarantees, Fast Algorithms, and Applications
Christopher Harshaw · Moran Feldman · Justin Ward · Amin Karbasi
It is generally believed that submodular functions--and the more general class of $\gamma$-weakly submodular functions--may only be optimized under the non-negativity assumption $f(S) \geq 0$. In this paper, we show that once the function is expressed as the difference $f = g - c$, where $g$ is monotone, non-negative, and $\gamma$-weakly submodular and $c$ is non-negative modular, then strong approximation guarantees may be obtained. We present an algorithm for maximizing $g - c$ under a $k$-cardinality constraint which produces a random feasible set $S$ such that $\mathbb{E}[g(S) -c(S)] \geq (1 - e^{-\gamma} - \epsilon) g(\opt) - c(\opt)$, whose running time is $O (\frac{n}{\epsilon} \log^2 \frac{1}{\epsilon})$, independent of $k$. We extend these results to the unconstrained setting by describing an algorithm with the same approximation guarantees and faster $O(n \frac{1}{\epsilon} \log\frac{1}{\epsilon})$ runtime. The main techniques underlying our algorithms are two-fold: the use of a surrogate objective which varies the relative importance between $g$ and $c$ throughout the algorithm, and a geometric sweep over possible $\gamma$ values. Our algorithmic guarantees are complemented by a hardness result showing that no polynomial-time algorithm which accesses $g$ through a value oracle can do better. We empirically demonstrate the success of our algorithms by applying them to experimental design on the Boston Housing dataset and directed vertex cover on the Email EU dataset.
Online Algorithms for Rent-Or-Buy with Expert Advice
Sreenivas Gollapudi · Debmalya Panigrahi
We study the use of predictions by multiple experts (such as machine learning algorithms) to improve the performance of online algorithms. In particular, we consider the classical rent-or-buy problem (also called ski rental), and obtain algorithms that provably improve their performance over the adversarial scenario by using these predictions. We also prove matching lower bounds to show that our algorithms are the best possible, and perform experiments to empirically validate their performance in practice
Non-monotone Submodular Maximization with Nearly Optimal Adaptivity and Query Complexity
Matthew Fahrbach · Vahab Mirrokni · Morteza Zadimoghaddam
As a general optimization problem, submodular maximization has a wide range of applications in machine learning (e.g., active learning, clustering, and feature selection). In large-scale optimization, the parallel running time of an algorithm is governed by its adaptivity, which measures the number of sequential rounds needed if the algorithm can execute polynomially-many independent oracle queries in parallel. While low adaptivity is ideal, it is not sufficient for an algorithm to be efficient in practice---there are many applications of distributed submodular optimization where the number of function evaluations becomes prohibitively expensive. Motivated by these applications, we study the adaptivity and query complexity of submodular maximization. In this paper, we give the first constant-approximation algorithm for maximizing a non-monotone submodular function subject to a cardinality constraint~$k$ that runs in $O(\log(n))$ adaptive rounds. Additionally, our algorithm makes only $O(n \log(k))$ oracle queries in expectation. In our empirical study, we use three real-world applications to compare our algorithm with several benchmarks for non-monotone submodular maximization, and the results show that our algorithm finds competitive solutions using \emph{significantly fewer rounds and queries}.
Categorical Feature Compression via Submodular Optimization
Mohammad Hossein Bateni · Lin Chen · Hossein Esfandiari · Thomas Fu · Vahab Mirrokni · Afshin Rostamizadeh
In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target binary labels and we furthermore show that its solution is guaranteed to be within a $1-1/e \approx 63\%$ factor of the global optimal solution. Although in some settings, entropy-based set functions are known to be submodular, this is not the case for the mutual information objective we consider (mutual information with respect to the target labels). To address this, we introduce a novel re-parametrization of the mutual information objective, which we prove is submodular, and also design a data structure to query the submodular function in amortized $O(\log n )$ time (where $n$ is the input vocabulary size). Our complete algorithm is shown to operate in $O(n \log n )$ time. Additionally, we design a distributed implementation in which the query data structure is decomposed across $O(k)$ machines such that each machine only requires $O(\frac n k)$ space, while still preserving the approximation guarantee and using only logarithmic rounds of computation. We also provide analysis of simple alternative heuristic compression methods to demonstrate they cannot achieve any approximation guarantee. Using the large-scale Criteo learning task, we demonstrate better performance in retaining mutual information and also verify competitive learning performance compared to other baseline methods.
We propose a novel formulation for phase synchronization---the statistical problem of jointly estimating alignment angles from noisy pairwise comparisons---as a nonconvex optimization problem that enforces consistency among the pairwise comparisons in multiple frequency channels. Inspired by harmonic retrieval in signal processing, we develop a simple yet efficient two-stage algorithm that leverages the multi-frequency information. We demonstrate in theory and practice that the proposed algorithm significantly outperforms state-of-the-art phase synchronization algorithms, at a mild computational costs incurred by using the extra frequency channels. We also extend our algorithmic framework to general synchronization problems over compact Lie groups.
Faster Algorithms for Binary Matrix Factorization
Ravi Kumar · Rina Panigrahy · Ali Rahimi · David Woodruff
We give faster approximation algorithms for well-studied variants of Boolean Matrix Factorization, where we are given a binary matrix $A \in \mathbb{R}^{m \times n}$ and would like to find binary matrices $U \in \{0,1\}^{m \times k}$ and $V \in \{0,1\}^{k \times n}$ so as to minimize $\|U \cdot V - A\|_F^2$. In the first setting, $U \cdot V$ denotes multiplication over the integers, and We give an algorithm outputting a constant factor approximation in $2^{O(k^2 \log k)} \textrm{poly}(mn)$ time, improving the previous $\min(2^{2^k}, 2^n) \textrm{poly}(mn)$ time. Our techniques generalize to finding $U \in \{0,1\}^{m \times k}$ and $V \in \{0,1\}^{k \times n}$ so as to approximately minimize $\|U \cdot V - A\|_p^p$, where $p \geq 1$ is any real number, in $2^{O(k^{\lceil p/2 \rceil + 1}\log k)} \textrm{poly}(mn)$ time. For $p = 1$, our results have a graph-theoretic interpretation. Namely, given an unweighted bipartite graph $G$ with $m$ vertices in the left part and $n$ vertices in the right part, how well can one approximate $G$ as a disjoint union $H$ of $k$ bipartite cliques (bicliques)? A natural notion of approximation is to find a disjoint union $H$ of $k$ bicliques so as to minimize the number of edges in the symmetric difference, i.e., to find $H$ with $$|E(G)\setminus E(H)| + |E(H)\setminus E(G)| \leq C \min_{\textrm{a disjoint union }H' \textrm{ of } k \textrm{ cliques}} |E(G)\setminus E(H')| + |E(H') \setminus E(G)|,$$ where $C > 1$ is a constant approximation factor. It is known that deciding if a bipartite graph is equal to the disjoint union of $k$ bicliques requires $2^{\Omega(k)} \cdot \textrm{poly}(mn)$ time under standard complexity-theoretic assumptions, and so this lower bound also applies to the approximate case. While there are algorithms to decide if $G$ can be expressed as the disjoint union of $k$ bicliques in $2^{O(k^2)} \cdot \textrm{poly}(mn)$ time, surprisingly, no algorithms were known for the approximate version of the problem. We give the first constant factor approximation algorithm for this problem, running in $2^{O(k^2 \log k)} \textrm{poly}(mn)$ time. Finally, we give the fastest known bicriteria constant factor approximation when the multiplication $U \cdot V$ is over the finite field $GF(2)$. We achieve $2^{O(k^3)} \poly(mn)$ time to output binary rank $O(k \log m)$ matrices $U$ and $V$ whose cost is as good as the best rank-$k$ approximation, improving the previous $\min(2^{2^k}mn, \min(m,n)^{O(\poly(k))} \textrm{poly}(mn))$ time.
Guided evolutionary strategies: augmenting random search with surrogate gradients
Niru Maheswaranathan · Luke Metz · George Tucker · Dami Choi · Jascha Sohl-Dickstein
Many applications in machine learning require optimizing a function whose true gradient is unknown or computationally expensive, but where surrogate gradient information, directions that may be correlated with the true gradient, is cheaply available. For example, this occurs when an approximate gradient is easier to compute than the full gradient (e.g. in meta-learning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e.g. in reinforcement learning or training networks with discrete variables). We propose Guided Evolutionary Strategies (GES), a method for optimally using surrogate gradient directions to accelerate random search. GES defines a search distribution for evolutionary strategies that is elongated along a subspace spanned by the surrogate gradients and estimates a descent direction which can then be passed to a first-order optimizer. We analytically and numerically characterize the tradeoffs that result from tuning how strongly the search distribution is stretched along the guiding subspace and use this to derive a setting of the hyperparameters that works well across problems. We evaluate GES on several example problems, demonstrating an improvement over both standard evolutionary strategies and first-order methods that directly follow the surrogate gradient.
Adaptive and Safe Bayesian Optimization in High Dimensions via One-Dimensional Subspaces
Johannes Kirschner · Mojmir Mutny · Nicole Hiller · Rasmus Ischebeck · Andreas Krause
Bayesian optimization is known to be difficult to scale to high dimensions, because the acquisition step requires solving a non-convex optimization problem in the same search space. In order to scale the method and keep its benefits, we propose an algorithm (LineBO) that restricts the problem to a sequence of iteratively chosen one-dimensional sub-problems. We show that our algorithm converges globally and obtains a fast local rate when the function is strongly convex. Further, if the objective has an invariant subspace, our method automatically adapts to the effective dimension without changing the algorithm. Our method scales well to high dimensions and makes use of a global Gaussian process model. When combined with the SafeOpt algorithm to solve the sub-problems, we obtain the first safe Bayesian optimization algorithm with theoretical guarantees applicable in high-dimensional settings. We evaluate our method on multiple synthetic benchmarks, where we obtain competitive performance. Further, we deploy our algorithm to optimize the beam intensity of a free electron laser with up to 40 parameters while satisfying the safe operation constraints.
Semi-Cyclic Stochastic Gradient Descent
Hubert Eichner · Tomer Koren · Brendan McMahan · Nati Srebro · Kunal Talwar
We consider convex SGD updates with a blockcyclic structure, i.e. where each cycle consists of a small number of blocks, each with many samples from a possibly different, block-specific, distribution. This situation arises, e.g., in Federated Learning where the mobile devices available for updates at different times during the day have different characteristics. We show that such block-cyclic structure can significantly deteriorate the performance of SGD, but propose a simple correction approach that allows prediction with the same performance guarantees as for i.i.d., non-cyclic, sampling.