Pegasos: Primal Estimated sub-GrAdient SOlver for SVM

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM
Shai Shalev-Shwartz - Hebrew University, Israel Yoram Singer - Google Inc., USA Nathan Srebro - Toyota Technological Institute at Chicago, USA
We describe and analyze a simple and effective iterative algorithm for solving the optimization problem cast by Support Vector Machines (SVM). Our method alternates between stochastic gradient descent steps and projection steps. We prove that the number of iterations required to obtain a solution of accuracy e is O(1/e). In contrast, previous analyses of stochastic gradient descent methods require Omega(1/e^2 ) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/lambda, where lambda is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is O(d/(lambda*e)), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach can seamlessly be adapted to employ non-linear kernels while working solely on the primal objective function. We demonstrate the efficiency and applicability of our approach by conducting experiments on large text classification problems, comparing our solver to existing state-of-the-art SVM solvers. For example, it takes less than 5 seconds for our solver to converge when solving a text classification problem from Reuters Corpus Volume 1 (RCV1) with 800, 000 training examples.

Shai Shalev-Shwartz - Hebrew University, Israel
Yoram Singer - Google Inc., USA
Nathan Srebro - Toyota Technological Institute at Chicago, USA

We describe and analyze a simple and effective iterative algorithm for solving the optimization problem cast by Support Vector Machines (SVM). Our method alternates between stochastic gradient descent steps and projection steps. We prove that the number of iterations required to obtain a solution of accuracy e is O(1/e). In contrast, previous analyses of stochastic gradient descent methods require Omega(1/e^2 ) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/lambda, where lambda is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is O(d/(lambda*e)), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach can seamlessly be adapted to employ non-linear kernels while working solely on the primal objective function. We demonstrate the efficiency and applicability of our approach by conducting experiments on large text classification problems, comparing our solver to existing state-of-the-art SVM solvers. For example, it takes less than 5 seconds for our solver to converge when solving a text classification problem from Reuters Corpus Volume 1 (RCV1) with 800, 000 training examples.