Paper ID: 834
Title: On Graduated Optimization for  Stochastic Non-Convex Problems

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper, the author developed an approach for a particular class of non-convex functions that is closely related to the idea of graduated optimization. 

Clarity - Justification: Paper is well written and easy to follow. Both the new concepts and the algorithms are well presented. 

Significance - Justification: I think the significance of this work is to introduce so-called sigma-niceness function that nicely capture the concept of granularity. It is the definition of the sigma-niceness function that allows us to start optimizing the coarse-granular version of target functions and gradually narrow down the solution to a small domain. Although authors show an example of sigma-niceness functions, it is however unclear if it is possible to efficiently decide if any function is sigma-niceness. Finally, the empirical results on NN are not very convincing since there are many heuristic approaches that address the problem of non-convex optimization (e.g. annealing), and the authors should compare their approaches to the existing heuristics (note that the authors did not show any evidence that the objective function of NN is sigma-niceness, and therefore none of the analysis applies here). 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See the comments in the section of significance. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Nice approach with a strong assumption (niceness) on the non-convex function class. The class of functions is interesting, however lacks some motivation if the assumption might be related to the landscape of actual neural networks (beyond the fact that valleys are possible).   The experiments show that the algorithm obtains better quality than the standard algorithm of mini-batch SGD, which is interesting. While it seems ok to demonstrate the phenomenon for a 1-layer neural network, at least a discussion and empirical studies are in order for the situation of actual deeper neural networks.

Clarity - Justification: Clearly written 

Significance - Justification: The main question remaining is if the niceness property is a) necessary and b) capturing a realistic aspect of any neural network  Quoting from the paper itself: "Is niceness necessary for convergence of first-order methods to a global optimum? Is there a more lenient property that better captures the power of graduated optimization?"  Question b) is even more pressing as the authors don't attempt to motivate any relation of niceness and neural network objective functions, beyond briefly mentioning that a NN can have valleys. In particular the centering property is a very strong assumption and appears completely out of the blue. This needs to be motivated to give relevance to the work, despite the concise proven theorems.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): n.a.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This work presents a "graduated optimization" algorithm for non-convex problems and provides performance guarantee. The proposed algorithm relies on solving smoothed versions of the objective function, where the strength of smoothing is gradually reduced. The smoothing process helps eliminate sharp valley points that may slow down SGD. The guarantee requires the cost function to satisfy certain properties that the authors call \sigma-niceness. These properties ensure that following the solution of the smoothed problem can eventually lead to the global minimum of the function. Experiments with a neural network to learn down-sampled MNIST data shows a clear improvement in quickly reducing the training error compared to SGD. 

Clarity - Justification: The paper is organized very well, and notions such as smoothing and valleys are illustrated very well by plots and figures. 

Significance - Justification: This work is a very significant progress in the theoretical analysis of graduated optimization methods. Despite the empirical success of graduated optimization, only recently its theoretical analysis has received attention. This paper gives a new theoretical guarantee that is important in two ways: 1. it relies on intuitive properties of the cost function such as \sigma-niceness (compared to other recent results on this method). 2. The smoothing is obtained by sampling and thus it is not required to have a closed form expression for the smoothed function.  Given the growing interest in sophisticated models (that involve nonconvex optimization for learning or inference) such as deep learning, this line of research may create a great impact by eventually producing more effective learning algorithms to deal with such models. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. I suppose the error plots are associated with "training error". Have you also tried to monitor the generalization error? Although the proposed algorithm is solely designed to minimize the training loss, if the capacity of the network is well controlled, the proposed method should be able to give greater generalization as well. That comparison would be nice, but not necessary.  2. Does the valley in Fig 3 satisfy the condition of Lemma 3.4? Just to make sure the constraint on the magnitude of the exponential term as stated by Lemma 3.4 can still lead to distinct enough valleys visually.   3. Have you thought about smoother distributions for smoothing? e.g., instead of the uniform distribution within a ball, considering a Gaussian distribution whose standard deviation is related to the radius of the ball? Often working with smoother distributions for averaging is more advantageous in practice, but perhaps it could make the analysis more complicated. I was wondering if you have thought about it, and if so, what stopped you to use that instead of uniform distribution?  Minor comment: in equation (2), should replace x^2 with ||x||^2. 

=====