Track: Optimization: Convex and Non-convex

Wed 12 June 14:00 - 14:20 PDT

Almost surely constrained convex optimization

Olivier Fercoq · Ahmet Alacaoglu · Ion Necoara · Volkan Cevher

We propose a stochastic gradient framework for solving stochastic composite convex optimization problems with (possibly) infinite number of linear inclusion constraints that need to be satisfied almost surely. We use smoothing and homotopy techniques to handle constraints without the need for matrix-valued projections. We show for our stochastic gradient algorithm $\mathcal{O}(\log(k)/\sqrt{k})$ convergence rate for general convex objectives and $\mathcal{O}(\log(k)/k)$ convergence rate for restricted strongly convex objectives. These rates are known to be optimal up to logarithmic factors, even without constraints. We demonstrate the performance of our algorithm with numerical experiments on basis pursuit, a hard margin support vector machines and a portfolio optimization and show that our algorithm achieves state-of-the-art practical performance.

Wed 12 June 14:20 - 14:25 PDT

Generalized Majorization-Minimization

Sobhan Naderi Parizi · Kun He · Reza Aghajani · Stan Sclaroff · Pedro Felzenszwalb

Non-convex optimization is ubiquitous in machine learning. Majorization-Minimization (MM) is a powerful iterative procedure for optimizing non-convex functions that works by optimizing a sequence of bounds on the function. In MM, the bound at each iteration is required to touch the objective function at the optimizer of the previous bound. We show that this touching constraint is unnecessary and overly restrictive. We generalize MM by relaxing this constraint, and propose a new optimization framework, named Generalized Majorization-Minimization (G-MM), that is more flexible. For instance, G-MM can incorporate application-specific biases into the optimization procedure without changing the objective function. We derive G-MM algorithms for several latent variable models and show empirically that they consistently outperform their MM counterparts in optimizing non-convex objectives. In particular, G-MM algorithms appear to be less sensitive to initialization.

Wed 12 June 14:25 - 14:30 PDT

On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization

Hao Yu · rong jin

For SGD based distributed stochastic optimization, computation complexity, measured by the convergence rate in terms of the number of stochastic gradient access, and communication complexity, measured by the number of inter-node communication rounds, are the most important two performance metrics. The classical data-parallel implementation of SGD over $N$ workers can achieve a linear speedup of its convergence rate but incurs an inter-node communication round at each batch. We study the benefit of using dynamically increasing batch sizes in parallel SGD for stochastic non-convex optimization by charactering the attained convergence rate and the required number of communication rounds. We show that for stochastic non-convex optimization under the P-L condition, the classical data parallel SGD with exponentially increasing batch sizes can achieve the fastest known $O(1/(NT))$ convergence with linear speedup using only $\log(T)$ communication rounds. For general stochastic non-convex optimization, we propose a Catalyst-like algorithm that achieves the fastest known $O(1/\sqrt{NT})$ convergence with linear speedup using only $O(\sqrt{NT}\log(\frac{T}{N}))$ communication rounds.

Wed 12 June 14:30 - 14:35 PDT

Simple Stochastic Gradient Methods for Non-Smooth Non-Convex Regularized Optimization

Michael Metel · Akiko Takeda

Our work focuses on stochastic gradient methods for optimizing a smooth non-convex loss function with a non-smooth non-convex regularizer. Research on this class of problem is quite limited, and until very recently no non-asymptotic convergence results have been reported. We present two simple stochastic gradient algorithms, for finite-sum and general stochastic optimization problems, which have superior convergence complexities compared to the current state of the art. We also demonstrate superior performance of our algorithms in practice for empirical risk minimization on well known datasets.

Wed 12 June 14:35 - 14:40 PDT

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization

zhenxun zhuang · Ashok Cutkosky · Francesco Orabona

Stochastic Gradient Descent (SGD) has played a central role in machine learning. However, it requires a carefully hand-picked stepsize for fast convergence, which is notoriously tedious and time-consuming to tune. Over the last several years, a plethora of adaptive gradient-based algorithms have emerged to ameliorate this problem. In this paper, we propose new surrogate losses to cast the problem of learning the optimal stepsizes for the stochastic optimization of a non-convex smooth objective function onto an online convex optimization problem. This allows the use of no-regret online algorithms to compute optimal stepsizes on the fly. In turn, this results in a SGD algorithm with self-tuned stepsizes that guarantees convergence rates that are automatically adaptive to the level of noise.

Wed 12 June 14:40 - 15:00 PDT

Efficient Dictionary Learning with Gradient Descent

Dar Gilboa · Sam Buchanan · John Wright

Randomly initialized first-order optimization algorithms are the method of choice for solving many high-dimensional nonconvex problems in machine learning, yet general theoretical guarantees cannot rule out convergence to critical points of poor objective value. For some highly structured nonconvex problems however, the success of gradient descent can be understood by studying the geometry of the objective. We study one such problem -- complete orthogonal dictionary learning, and provide converge guarantees for randomly initialized gradient descent to the neighborhood of a global optimum. The resulting rates scale as low order polynomials in the dimension even though the objective possesses an exponential number of saddle points. This efficient convergence can be viewed as a consequence of negative curvature normal to the stable manifolds associated with saddle points, and we provide evidence that this feature is shared by other nonconvex problems of importance as well.

Wed 12 June 15:00 - 15:05 PDT

Plug-and-Play Methods Provably Converge with Properly Trained Denoisers

Ernest Ryu · Jialin Liu · Sicheng Wang · Xiaohan Chen · Zhangyang Wang · Wotao Yin

Plug-and-play (PnP) is a non-convex framework that integrates modern denoising priors, such as BM3D or deep learning-based denoisers, into ADMM or other proximal algorithms. An advantage of PnP is that one can use pre-trained denoisers when there is not sufficient data for end-to-end training. Although PnP has been recently studied extensively and has exhibited great empirical results, theoretical analysis addressing even the most basic question of convergence has been insufficient. In this paper, we theoretically establish convergence of PnP-FBS and PnP-ADMM, without using diminishing stepsizes, under a certain Lipschitz condition on the denoisers. We then propose a technique, which we call real spectral normalization, to train deep learning-based denoisers that satisfy the proposed Lipschitz condition. Finally, we present experimental results that validate the theory.

Wed 12 June 15:05 - 15:10 PDT

Riemannian adaptive stochastic gradient algorithms on matrix manifolds

Hiroyuki Kasai · Pratik Kumar Jawanpuria · Bamdev Mishra

Adaptive stochastic gradient algorithms in the Euclidean space have attracted much attention lately. Such explorations on Riemannian manifolds, on the other hand, are relatively new, limited, and challenging. This is because of the intrinsic non-linear structure of the underlying manifold and the absence of a canonical coordinate system. In machine learning applications, however, most manifolds of interest are represented as matrices with notions of row and column subspaces. In addition, the implicit manifold-related constraints may also lie on such subspaces. For example, the Grassmann manifold is the set of column subspaces. To this end, such a rich structure should not be lost by transforming matrices to just a stack of vectors while developing optimization algorithms on manifolds.

We propose novel stochastic gradient algorithms for problems on Riemannian manifolds by adapting the row and column subspaces of gradients. Our algorithms are provably convergent and they achieve the convergence rate of order O(log(T)/sqrt(T)), where T is the number of iterations. Our experiments illustrate that the proposed algorithms outperform existing Riemannian adaptive stochastic algorithms.

Wed 12 June 15:10 - 15:15 PDT

Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with Non-asymptotic Convergence

Yi Xu · Qi Qi · Qihang Lin · rong jin · Tianbao Yang

Difference of convex (DC) functions cover a broad family of non-convex and possibly non-smooth and non-differentiable functions, and have wide applications in machine learning and statistics. Although deterministic algorithms for DC functions have been extensively studied, stochastic optimization that is more suitable for learning with big data remains under-explored. In this paper, we propose new stochastic optimization algorithms and study their first-order convergence theories for solving a broad family of DC functions. We improve the existing algorithms and theories of stochastic optimization for DC functions from both practical and theoretical perspectives. Moreover, we extend the proposed stochastic algorithms for DC functions to solve problems with a general non-convex non-differentiable regularizer, which does not necessarily have a DC decomposition but enjoys an efficient proximal mapping. To the best of our knowledge, this is the first work that gives the first non-asymptotic convergence for solving non-convex optimization whose objective has a general non-convex non-differentiable regularizer.

Wed 12 June 15:15 - 15:20 PDT

Alternating Minimizations Converge to Second-Order Optimal Solutions

Qiuwei Li · Zhihui Zhu · Gongguo Tang

This work studies the second-order convergence for both standard alternating minimization and proximal alternating minimization. We show that under mild assumptions on the (nonconvex) objective function, both algorithms avoid strict saddles almost surely from random initialization. Together with known first-order convergence results, this implies both algorithms converge to a second-order stationary point. This solves an open problem for the second-order convergence of alternating minimization algorithms that have been widely used in practice to solve large-scale nonconvex problems due to their simple implementation, fast convergence, and superb empirical performance.

Main Navigation

Session

Optimization: Convex and Non-convex