We thank the reviewers for their time and effort and appreciate their useful comments. $ Reviewer 1: =========== * Error plots: We have presented the training error. The error on the tests set presents similar behaviour, and it is a good idea to include it in the final version of the paper. * “Does the valley in Fig 3 satisfy the condition of Lemma 3.4? Just to make sure the constraint on the magnitude of the exponential term as stated by Lemma 3.4 can still lead to distinct enough valleys visually.” The 2D valley depicted in Fig 3 uses \lambda = 0.1 and \alpha = 0.5, which does not satisfy Lemma 3.4. However, the \alpha constant in Lemma 3.4 is not tight and it could be easily validated numerically that Fig 3 does describe a sigma-nice function with sigma = 0.5 (and a=1). The validation is made easy since as in the proof of Lemma 3.4, showing the valley to be sigma-nice reduces to showing that one of its one-dimensional projections is sigma-nice. * “Have you thought about smoother distributions for smoothing …I was wondering if you have thought about it, and if so, what stopped you to use that instead of uniform distribution?” We have given results for uniform ball smoothing, since these were most natural to us. The results can indeed be extended, without much difficulty, to Gaussian smoothing. Reviewer 2: =========== * “While it seems ok to demonstrate the phenomenon for a 1-layer neural network, at least a discussion and empirical studies are in order for the situation of actual deeper neural networks.” The experimental study serves as a proof of concept for the possible benefits of our suggested method. And also demonstrates the valley phenomenon on a real (small scale) problem. We agree that further experimental study is in place, however we believe that this is a subject of a more comprehensive investigation in a future work. * “The main question remaining is if the niceness property is a) necessary and b) capturing a realistic aspect of any neural network… Question b) is even more pressing as the authors don't attempt to motivate any relation of niceness and neural network objective functions, beyond briefly mentioning that a NN can have valleys. In particular the centering property is a very strong assumption and appears completely out of the blue. This needs to be motivated to give relevance to the work, despite the concise proven theorems” We agree that these are very interesting questions (we mention them as future directions), and consider this paper as only a first step in the (very) challenging task of understanding when we can efficiently reach global optimum in non-convex optimization. In this effort, we also give the first rigorous analysis of a common heuristic for non-convex optimization (graduated optimization), and show when it is guaranteed to converge to the global minimizer. Reviewer 3: =========== * “Although authors show an example of sigma-niceness functions, it is however unclear if it is possible to efficiently decide if any function is sigma-niceness… note that the authors did not show any evidence that the objective function of NN is sigma-niceness, and therefore none of the analysis applies here” We show that the valley phenomenon which appeared in the NN that we have investigated is captured by the sigma-nice property. This valley explains the reason why SGD stalls, and demonstrates the benefits of smoothing the objective. This is a first step in exploring sigma-nice structure in practice, and one of the first anywhere to explore when non-convex function structure can lead to efficient optimization. Admittedly, there is much study left for future work. * “Finally, the empirical results on NN are not very convincing since there are many heuristic approaches that address the problem of non-convex optimization (e.g. annealing)” We compare to SGD since this is the (basic) most popular optimization method in the field of NNs. The experiments that we present are aimed at: 1. Demonstrating the stall of SGD caused by the valley phenomenon 2. Showing how Graduated optimization can overcome these valleys, and yield better performance than SGD. Interestingly, we can think of graduated optimization as a kind of annealing: as the smoothing (temperature) rises, then the smoothing is coarse and the method tends to overcome local minima, and when the smoothing (temperature) decreases we approach the original objective and the method tends to follow the greedy (original gradient) direction.