Session
Large Scale Learning and Big Data 2
Near Optimal Frequent Directions for Sketching Dense and Sparse Matrices
Zengfeng Huang
Given a large matrix $A\in\real^{n\times d}$, we consider the problem of computing a sketch matrix $B\in\real^{\ell\times d}$ which is significantly smaller than but still well approximates $A$. We are interested in minimizing the \emph{covariance error} $\norm{A^TA-B^TB}_2.$We consider the problems in the streaming model, where the algorithm can only make one pass over the input with limited working space. The popular Frequent Directions algorithm of~\cite{liberty2013simple} and its variants achieve optimal space-error tradeoff. However, whether the running time can be improved remains an unanswered question.In this paper, we almost settle the time complexity of this problem. In particular, we provide new space-optimal algorithms with faster running times. Moreover, we also show that the running times of our algorithms are near-optimal unless the state-of-the-art running time of matrix multiplication can be improved significantly.
Loss Decomposition for Fast Learning in Large Output Spaces
En-Hsu Yen · Satyen Kale · Felix Xinnan Yu · Daniel Holtmann-Rice · Sanjiv Kumar · Pradeep Ravikumar
For problems with large output spaces, evaluation of the loss function and its gradient are expensive, typically taking linear time in the size of the output space. Recently, methods have been developed to speed up learning via efficient data structures for Nearest-Neighbor Search (NNS) or Maximum Inner-Product Search (MIPS). However, the performance of such data structures typically degrades in high dimensions. In this work, we propose a novel technique to reduce the intractable high dimensional search problem to several much more tractable lower dimensional ones via dual decomposition of the loss function. At the same time, we demonstrate guaranteed convergence to the original loss via a greedy message passing procedure. In our experiments on multiclass and multilabel classification with hundreds of thousands of classes, as well as training skip-gram word embeddings with a vocabulary size of half a million, our technique consistently improves the accuracy of search-based gradient approximation methods and outperforms sampling-based gradient approximation methods by a large margin.
Ultra Large-Scale Feature Selection using Count-Sketches
Amirali Aghazadeh · Ryan Spring · Daniel LeJeune · Gautam Dasarathy · Anshumali Shrivastava · Richard Baraniuk
Feature selection is an important challenge in machine learning. It plays a crucial role in the explainability of machine-driven decisions that are rapidly permeating throughout modern society. Unfortunately, the explosion in the size and dimensionality of real-world datasets poses a severe challenge to standard feature selection algorithms. Today, it is not uncommon for datasets to have billions of dimensions. At such scale, even storing the feature vector is impossible, causing most existing feature selection methods to fail. Workarounds like feature hashing, a standard approach to large-scale machine learning, helps with the computational feasibility, but at the cost of losing the interpretability of features. In this paper, we present MISSION, a novel framework for ultra large-scale feature selection that performs stochastic gradient descent while maintaining an efficient representation of the features in memory using a Count-Sketch data structure. MISSION retains the simplicity of feature hashing without sacrificing the interpretability of the features while using only O(log^2(p)) working memory. We demonstrate that MISSION accurately and efficiently performs feature selection on real-world, large-scale datasets with billions of dimensions.
Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions
Shuaiwen Wang · Wenda Zhou · Haihao Lu · Arian Maleki · Vahab Mirrokni
We study the parameter tuning problem for the penalized regression model. Finding the optimal choice of the regularization parameter is a challenging problem in high-dimensional regimes where both the number of observations n and the number of parameters p are large. We propose two frameworks to obtain a computationally efficient approximation ALO of the leave-one-out cross validation (LOOCV) risk for nonsmooth losses and regularizers. Our two frameworks are based on the primal and dual formulations of the penalized regression model. We prove the equivalence of the two approaches under smoothness conditions. This equivalence enables us to justify the accuracy of both methods under such conditions. We use our approaches to obtain a risk estimate for several standard problems, including generalized LASSO, nuclear norm regularization and support vector machines. We experimentally demonstrate the effectiveness of our results for non-differentiable cases.
Semi-Supervised Learning on Data Streams via Temporal Label Propagation
Tal Wagner · Sudipto Guha · Shiva Kasiviswanathan · Nina Mishra
We consider the problem of labeling points on a fast-moving data stream when only a small number of labeled examples are available. In our setting, incoming points must be processed efficiently and the stream is too large to store in its entirety. We present a semi-supervised learning algorithm for this task. The algorithm maintains a small synopsis of the stream which can be quickly updated as new points arrive, and labels every incoming point by provably learning from the full history of the stream. Experiments on real datasets validate that the algorithm can quickly and accurately classify points on a stream with a small quantity of labeled examples.