Paper ID: 363 Title: On the Quality of the Initial Basin in Overspecified Neural Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper analyses the quality of the basin of attraction for learning in over-specified regime for a ReLU network. Clarity - Justification: Given that this is a theoretical paper I think the quality of the write-up is good. The appendix is used to provide all the proofs, and it seems to be detailed enough. There are small improvements possible, though these things are usually hard to be perfectly written for everyone. Significance - Justification: The paper offers an interesting alternative explanation for the results in Dauphin et al. without trying to reduce the model to a spin glass model. This kind of theoretical constructs could be very useful to understand many of the existing standing theoretical questions about neural networks. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Quite interesting paper. It tries to address, from a theoretical point of view, the results in Dauphin et al. Compared to other works, they do not try to reduce the problem to a spin glass model in order to prove the structure of the loss. Rather they take into account the nature of the model (ReLU activations) and try to make a geometric argument for this property. The results they show is not complete yet IMHO. For e.g. assumptions are made, like the fact that the data is formed of singletons or clustered in a particular way. The proof also focuses on a single layer MLP. However I think these constructs are interesting, and might be employed in the future for more insightful proofs. Regarding the over specification of the model I think is not extremely problematic. We know that the structure of the loss function changes in this regime (getting the particular structure where local minima are close to global minimum). My understanding of the proof is that in the case of the over specified the proof does not rely in simply encoding each training example in the weights of say the output layer. From this perspective, I think the result is insightful. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides a compelling theoretical explanation for the observation that overparametrized neural networks are often very easy to train. The paper gives a sequence of theorems each providing a different view onto this problem. The first result essentially shows that if there is a path from a bad initial point (that is worse than the all-zeros point) to a better solution, then there is a path along which the objective monotonically decreases. They show that a natural randomized initialization produces a point satisfying the assumptions of their theorem with high probability. The second result shows that for two-layer networks, adding extra neurons (nodes in the single hidden layer) holding the activation weights of other neurons fixed always improves the value of the optimal point in the basin of attraction. The third result shows that an impressive counterexample of Auer's which demonstrates the difficulty of fitting a neural net with a single neuron becomes *easier* when more neurons are introduced. The last results show that when the data are structured, the probability of initializing in a basin with a good local minimum is extremely high. Clarity - Justification: Beautifully written, well motivated, thoughtfully assembled. Significance - Justification: Understanding the success of neural networks in practice remains an important theoretical problem in machine learning. This paper provides a number of compelling ways to understand this good performance in realistic problem settings. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): why is the second bullet of definition 1 necessary? line 405: what would it mean for there *not* to exist a continuous path? does this happen only when the activation weights are constrained? line 471: it would be nice to have a geometric picture or interpretation of this condition. line 643: it might be nice to define the basin value closer to the first place it is used in the paper. Theorem 4: clarify that the theorem holds for any singleton distribution; the expectation is over the randomness in the initialization line 816: is it true that Bas(W,v) \leq \alpha iff the basin around W and v contains a global minimum? You might choose notation to make this clearer; for example, using p^\star in place of \alpha. line 841: the quantifier \exists is not neccesary ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an analysis of the optimization landscape associated with learning "large" ReLU networks. The result is theoretical and the paper does not promise algorithmic success from the presented work. Nevertheless, the result is very interesting; from any random initialization of large ReLU networks, there is a path to the global minimum along which the cost function monotonically decreases. This is fascinating because it shows there is no need to hop from one basin of attraction to another (which makes life difficult for local descent methods) in order to reach the global minimum. This provides a solid argument about some kind of simplicity of optimization landscapes of large ReLU networks and confirms some earlier empirical observations. What makes this paper more relevant is that the analysis applies to realistic networks (ReLU), unlike earlier analyses on over simplified models such as polynomial networks or approximation by spin glass models. Clarity - Justification: The paper might benefit from additional clarifications provided in the detailed comments. Significance - Justification: Understanding the shape of high-dimensional optimization landscapes associated with deep learning is a great open problem. Progress in this direction can generally provide insights for designing better algorithms or confidence about efficacy of the existing algorithms. So I indeed find this line of research very promising. What makes this paper more relevant is that the analysis applies to realistic networks (ReLU), unlike earlier analyses on over simplified models such as polynomial networks or approximation by spin glass models. However, the specific contributions in this paper are a bit limited and it is not clear if and the provided insights could lead to algorithmic improvements. See detailed comments for the argument. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall I have very positive opinion about this work. The paper addresses a very interesting and difficult open problem and provides solid results. Nevertheless, I have some concerns listed below. 1. EMPHASIZE THE LIMITATION OF THE RESULT MORE EXPLICITLY AND CLEARLY: The authors emphasize throughout the paper that their analysis is merely geometric and does not promise the success of methods like gradient descent. However, it may not immediately be clear to the reader that "why" gradient descent may fail and why just having a monotonically decreasing cost value between initial point and the global minimum is not enough for those methods to succeed. I think a visual illustrative example would be very helpful. For example, even a 1D plot showing that the monotonically decreasing path to the global minimum from the center has less slope that the one which takes the center sharply to a local minimum would be very insightful. 2. EMPHASIZE THAT OVER-SPECIFIED NETWORKS MAY LEAD TO LARGER GENERALIZATION ERROR: Although it is somewhat obvious, but please emphasize that working with over specified networks only guaranteed lower "training" error, but it may actually increase the generalization error due to over fitting. I understand that your paper is merely about optimization and thus focuses on training error, but making this clarification is important for those who read this contribution from a learning perspective. 3. RECOMMENDING TO USE VISUAL ILLUSTRATION WHEREVER POSSIBLE (FOR THE IMPATIENT READERS TO GET A QUICK UNDERSTANDING OF THE CONTRIBUTION): The paper would generally benefit from some visual illustrations for those who want to quickly know what the paper is about, before deciding whether they want to read it in depth or not. I think there are places where you can use visual illustrations. For example, in the brittle example (Auer's), how about two 1D plots showing the training cost value along the line connecting the initial point to the global minimum, for the small and large networks. This would visually depict that by just changing the size of the network, how the cost value changes (hopefully not monotonically decreasing for the small net, but does for the large one). =====