Session
Large Scale Learning and Systems
Composable Core-sets for Determinant Maximization: A Simple Near-Optimal Algorithm
Sepideh Mahabadi · Piotr Indyk · Shayan Oveis Gharan · Alireza Rezaei
``Composable core-sets'' are an efficient framework for solving optimization problems in massive data models. In this work, we consider efficient construction of composable core-sets for the determinant maximization problem. This can also be cast as the MAP inference task for ``determinantal point processes", that have recently gained a lot of interest for modeling diversity and fairness. The problem was recently studied in \cite{indyk2018composable}, where they designed composable core-sets with the optimal approximation bound of $O(k)^k$. On the other hand, the more practical ``Greedy" algorithm has been previously used in similar contexts. In this work, first we provide a theoretical approximation guarantee of $C^{k^2}$ for the Greedy algorithm in the context of composable core-sets. Further, we propose to use a ``Local Search" based algorithm that while being still practical, achieves a nearly optimal approximation bound of $O(k)^{2k}$. Finally, we implement all three algorithms and show the effectiveness of our proposed algorithm on standard data sets.
Sublinear Time Nearest Neighbor Search over Generalized Weighted Space
Yifan Lei · Qiang Huang · Mohan Kankanhalli · Anthony Tung
Nearest Neighbor Search (NNS) over generalized weighted space is a fundamental problem which has many applications in various fields. However, to the best of our knowledge, there is no sublinear time solution to this problem. Based on the idea of Asymmetric Locality Sensitive Hashing (ALSH), we introduce a novel spherical asymmetric transformation and propose the first two novel weight-oblivious hashing schemes SL-ALSH and S2-ALSH accordingly. We further show that both schemes enjoy a quality guarantee and can answer the NNS queries in sublinear time. Evaluations over three real datasets demonstrate the superior performance of the two proposed schemes.
Compressing Gradient Optimizers via Count-Sketches
Ryan Spring · Anastasios Kyrillidis · Vijai Mohan · Anshumali Shrivastava
Many popular first-order optimization methods (e.g., Momentum, AdaGrad, Adam) accelerate the convergence rate of deep learning models. However, these algorithms require auxiliary parameters, which cost additional memory proportional to the number of parameters in the model. The problem is becoming more severe as deep learning models continue to grow larger in order to learn from complex, large-scale datasets. Our proposed solution is to maintain a linear sketch to compress the auxiliary variables. We demonstrate that our technique has the same performance as the full-sized baseline, while using significantly less space for the auxiliary variables. Theoretically, we prove that count-sketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for large-models. On the large-scale 1-Billion Word dataset, we save 25% of the memory used during training (8.6 GB instead of 11.7 GB) with minimal accuracy and performance loss. For an Amazon extreme classification task with over 49.5 million classes, we also reduce the training time by 38%, by increasing the mini-batch size 3.5x using our count-sketch optimizer.
Scalable Fair Clustering
Arturs Backurs · Piotr Indyk · Krzysztof Onak · Baruch Schieber · Ali Vakilian · Tal Wagner
We study the fair variant of the classic k-median problem introduced by (Chierichetti et al., NeurIPS 2017). In the standard k-median problem, given an input pointset P, the goal is to find k centers C and assign each input point to one of the centers in C such that the average distance of points to their cluster center is minimized. In the fair variant of k-median, the points are colored, and the goal is to minimize the same average distance objective while ensuring that all clusters have an “approximately equal” number of points of each color.
(Chierichetti et al., NeurIPS 2017) proposed a two-phase algorithm for fair k-median. In the first step, the pointset is partitioned into subsets called fairlets that satisfy the fairness requirement and approximately preserve the k-median objective. In the second step, fairlets are merged into k clusters by one of the existing k-median algorithms. The running time of this algorithm is dominated by the first step, which takes super-quadratic time.
In this paper, we present a practical approximate fairlet decomposition algorithm that runs in nearly linear time. We complement our theoretical bounds with empirical evaluation.
Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator
Alp Yurtsever · Suvrit Sra · Volkan Cevher
We propose a class of novel variance-reduced stochastic conditional gradient methods. By adopting the recent stochastic path-integrated differential estimator technique (SPIDER) of Fang et. al. (2018) for the classical Frank-Wolfe (FW) method, we introduce SPIDER-FW for finite-sum minimization as well as the more general expectation minimization problems. SPIDER-FW enjoys superior complexity guarantees in the non-convex setting, while matching the best known FW variants in the convex case. We also extend our framework a la conditional gradient sliding (CGS) of Lan & Zhou. (2016), and propose SPIDER-CGS to further reduce the stochastic first-order oracle complexity. Our numerical evidence supports our theoretical findings, and demonstrates the superiority of SPIDER-FW and SPIDER-CGS.
Fault Tolerance in Iterative-Convergent Machine Learning
Aurick Qiao · Bryon Aragam · Bingjing Zhang · Eric Xing
Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative- convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms. We then use this framework to derive a worst-case upper bound on the cost of arbitrary perturbations to model parameters during training and to design new strategies for checkpoint-based fault tolerance. Our system, SCAR, can reduce the cost of partial failures by 78%–95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms, providing near-optimal performance in recovering from failures.
Dynamic neural networks are becoming increasingly common, and yet it is hard to implement them efficiently. One-the-fly operation batching for such models is sub-optimal and suffers from run time overheads, while writing manually batched versions can be hard and error-prone. To address this we extend TensorFlow with pfor, a parallel-for loop optimized using static loop vectorization. With pfor, users can express computation using nested loops and conditional constructs, but get performance resembling that of a manually batched version. Benchmarks demonstrate speedups of one to two orders of magnitude on range of tasks, from jacobian computation, to TreeLSTMs.
Improving Neural Network Quantization without Retraining using Outlier Channel Splitting
Ritchie Zhao · Yuwei Hu · Jordan Dotzel · Christopher De Sa · Zhiru Zhang
Quantization can improve the execution latency and energy efficiency of neural networks on both commodity GPUs and specialized accelerators. The majority of existing literature focuses on training quantized DNNs, while this work examines the less-studied topic of quantizing a floating-point model without (re)training. DNN weights and activations follow a bell-shaped distribution post-training, while practical hardware uses a linear quantization grid. This leads to challenges in dealing with outliers in the distribution. Prior work has addressed this by clipping the outliers or using specialized hardware. In this work, we propose outlier channel splitting (OCS), which duplicates channels containing outliers, then halves the channel values. The network remains functionally identical, but affected outliers are moved toward the center of the distribution. OCS requires no additional training and works on commodity hardware. Experimental evaluation on ImageNet classification and language modeling shows that OCS can outperform state-of-the-art clipping techniques with only minor overhead.
Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Applications
Albert Gural · Boris Murmann
In the age of Internet of Things (IoT), embedded devices ranging from ARM Cortex M0s with 100s of KB of RAM to Arduinos with 2KB RAM are expected to perform increasingly intelligent classification tasks, such as voice and gesture recognition, activity tracking, and biometric security. While convolutional neural networks (CNNs), together with spectrogram preprocessing, are a natural solution to many of these classification tasks, storage of the network's activations often exceeds the hard memory constraints of embedded platforms. This paper presents memory-optimal direct convolutions as a way to push classification accuracy as high as possible given strict hardware memory constraints at the expense of extra compute, exploring the opposite end of the compute-memory trade-off curve from standard approaches that minimize latency at the expense of extra memory. We evaluate classification accuracy across a variety of small image and time series datasets employing memory-optimal CNNs and memory-efficient spectrogram preprocessing. We also validate the memory-optimal CNN technique with an Arduino implementation of the 10-class MNIST classification task, fitting the network specification, weights, and activations entirely within 2KB SRAM and achieving a state-of-the-art classification accuracy for small-scale embedded systems of 99.15%.
DL2: Training and Querying Neural Networks with Logic
Marc Fischer · Mislav Balunovic · Dana Drachsler-Cohen · Timon Gehr · Ce Zhang · Martin Vechev
We present DL2, a system for training and querying neural networks with logical constraints. Using DL2, one can declaratively specify domain knowledge to be enforced during training or pose queries on the model with the goal of finding inputs that satisfy a set of constraints. DL2 works by translating logical constraints into a differentiable loss with desirable mathematical properties, then minimized with standard gradient-based methods. We evaluate DL2 by training networks with interesting constraints in unsupervised, semi-supervised and supervised settings. Our experimental evaluation demonstrates that DL2 is both, more expressive than prior approaches combining logic and neural networks, and its resulting loss is better suited for optimization. Further, we show that for a number of queries, DL2 can find the desired inputs within seconds (even for large models such as ResNet-50 on ImageNet).