Paper ID: 451 Title: L1-regularized Neural Networks are Improperly Learnable in Polynomial Time Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proves that for a large class of activations, all networks of depth k and with constant L_1 norm bound L on the weights of each neuron are learnable. The complexity is polynomial in 1/eps and 1/delta, but much worst in k and L. The learning algorithm is kernel SVM with a certain symmetric kernel that the authors define. The proof shows that the class of such networks is contained in the kernel space, and bound the norm of those functions. Clarity - Justification: The paper is written fine. Significance - Justification: This is a nice results that add an additional result to the large bulk of works that show that certain functions classes can be approximated (or, in this case, expressed) by kernel spaces. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As I wrote in the previous section, the paper adds a nice result in the context of approximation of function spaces by kernel spaces. The proof is also quite elegant. That been said, I think that the obtained bounds are too large to make them meaningful (due to the dependence of k and B). Also, because of that, and because their algorithm is kernel SVM, I don't thin that the paper sheds a significant light on neural network learning. Finally, the algorithmic contribution is marginal, as after all, the algorithm is simply SVM with symmetric kernel. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies the problem of improper learning of multi-layer neural nets under sparsity assumptions. The authors show that neural networks whose depth is bounded by a constant (k), and the L_1 norm of the weights entering each neuron is also bounded by a constant (L) can be improperly learned by kernel methods, efficiently. The sample complexity and runtime of this method is polynomial in 1/\epsilon, \log(1/\delta), and a parameter F(k,L) that is only a function of k and L. The paper also shows that for squared activations function, F(k,L) is polynomial in L (but double exponential in k). For sigmoid-like and ReLU-like functions, F(k,L) is no longer a polynomial in L. The paper also shows that the exponential dependence of F(k,L) on L is inevitable for sigmoid-like and ReLU-like functions, by making a reduction from the problem of learning the intersection of half-spaces. Clarity - Justification: The paper is written clearly and the overview of the approach and techniques is explained well. Significance - Justification: The significance of this paper is in showing that multiplayer neural networks are learnable when certain assumptions are made about the number of layers and the sparsity of the weight vectors. This paper also shows the power of kernel methods in improperly learning multi-layer neural networks, which in the last few years has been of significant interest to the community. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper studies the problem of improper learning of multi-layer neural nets under assumptions on the depth of the network and the sparsity of the weight vectors. This paper shows the power of kernel methods in improperly learning multi-layer neural networks, which in the last few years has been of significant interest to the community. This is a nicely written paper on the topic and the results are novel and significant. The paper is written clearly and the authors explains the overall approach and the techniques used in the paper effectively before providing the details of the proofs (mainly left to the appendix). It would be nice to discuss the case where L is a logarithmic or loglog function. I think in this case, the current approach of the paper does not lead to efficient algorithms, but the hardness result also does not eliminate the possibility of existence of such algorithms. Is that the case? It would be nice to discuss this. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The main result of this paper is a way to (improperly) learn an l_1-sparsified neural network using a kernel. Clarity - Justification: Well written with a clear tracking of the history of the question. Significance - Justification: It's a nice result on the learnability of NNs Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a nice theoretical analysis that uses a kernel "trick" to capture ReLU and sigmoid-like layers in a neural network. Admittedly, the kernels used as slightly unwieldy, and so this result is primarily of theoretical interest, but it continues and adds onto a growing body of work connecting kernels and neural networks. On that note though, the authors miss a number of important references. * The original work by Cho and Saul on connecting kernels and neural networks, and the more recent work by Hazan and Jaakola in AISTATS 2016 that continues this exploration. * The results of Andoni et al from ICML 2014. While this reference does not address the same paper the authors look at, it's a related piece of work on how to learn polynomials using a neural network, with provable guarantees. http://theory.stanford.edu/~valiant/papers/andoni14.pdf =====