Paper ID: 62 Title: Why Regularized Auto-Encoders learn Sparse Representation? Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this mostly theoretical paper, the authors try to establish a formal connection between features learned by regularized auto-encoders and sparse representation. They study the conditions that favor sparsity, finding that such conditions are 1) positive encoding bias gradient, and 2) activation functions which are monotonically increasing, with negative saturation at zero. The ReLU, Softplus, and standard logistic sigmoid functions have this property but the Maxout and tanh activation functions not. The authors also claim that existing autoencoder have regularizations satisfying the conditions of Corollaries 1 and 2. In the experimental part, the authors show that bias gradient dominates the effect on sparsity of hidden units, and that constraining the weight vectors to have a fixed length leads to a better sparsity. On the other hand, without constraining the norm of the weight vector the experiments with the MNIST and CIFAR data sets gave qualitatively different results with respect to sparsity. It is somewhat hard for me to say how valuable the results derived in this paper are. Generally speaking it is very difficult if not impossible to analyze neural networks rigorously because of their criterion functions are generally non-convex having often many local minima, and because of their distributed nonlinearities. The authors make several simplifying assumptions in their analyses. In Assumption 1 they require that the data is white which is easy to achieve via preprocessing. But the other assumption which they make, namely that every component of the reconstruction residual vector during auto-encoder training at any iteration is i.i.d random variable with a Gaussian distribution with zero mean and common standard deviation is clearly an approximation only. Clarity - Justification: Even though the authors' have at their disposal 8 large double-column pages, this paper suffers from lack of space. This affects somewhat to its readability. The theoretical derivations are somewhat difficult to follow at places. Significance - Justification: I don't think that this paper is a major breakthrough. On the other hand it is not just a small incremental advance, but between these two cases. My overall assessment is that this paper increases to some extent understanding on conditions that encourage sparsity in auto-encoders, and is weakly acceptable to the ICML conference. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - The figures 1-5 are all too small due to the lack of space. Especially the text in Figure 3 is indiscernible. - The mathematical formulas for the activation functions ReLU, Maxout, Sigmoid, Tanh, and Sofplus should be presented even though most readers know them. - The references are incomplete. Such information as Technical Report, 2009, or ICLR, 2014 is clearly insufficient. For example on technical reports you should give the web site where they are available. And there is no space limitation here as there is an extra 10th page at disposal for references. - In footnote on page 6, the authors say "Some of the CIFAR-10 results have been moved to appendix due to lack of space". However, there are no experimental results in the appendix, only theoretical proofs. - In the beginning of subsection 3.2 the authors state that "... then increasing the value of (the regularization coefficient) (\sigma)^2 should lead to ... increasing sparsity". Somewhat later they say that "These plots show a stable decreasing sparsity trend with increasing regularization coefficient as predicted by our analysis." These statement are in conflict with each other, or have I misunderstood something? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a study on regularized autoencoders with a single layer and tied weights. The goal of the work is to show theoretical insights on why this type of neural networks learn sparse distributed representations. The main result is to provide a set of conditions for the activation and regularization functions that lead to sparse representations. The paper discusses the applicability of the findings in the context of a number of well known autoencores is discussed. Empirical evaluations are conducted to evaluate the findings from a practical point of view. Clarity - Justification: The organization and clarity of the paper is correct. However, the wording could be improved. Also, I found often difficult to follow some of the claims due to the presentation. Please see detailed comment bellow. Significance - Justification: The paper analyzes under which conditions single-layer autoencoders lead to sparse representations. Several type of autoencoders are used in practice and variants usually involve combinations of activation functions and regularization functions. A formal study shedding light on what are the important properties for obtaining sparsity is of value. In that regard, I find the first two results important. On the other hand, in its current form, I don't find very convincing the result unifying the different type of autoencoders. In particular the approximations given in Theorem 3 for casting denoising autoencores in the general form. It would also enhance the work arguing how these results could generalize to deep autoencoders or at least show empirically that the main finding hold. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The presented results concern only the case of single layer autoencoders. In that regard, the title and abstract seem to be a bit too generic. This should be clarified. Also, it might be interesting to include a discussion on how these observations could generalize to a deep setting and maybe evaluate experimental. It would be good to cite other works that have looked at the expected pre-activation to analyze the sparsity in the activations. For instance the cited work (Lee et al. 2008), or Xie, Junyuan, Linli Xu, and Enhong Chen. "Image denoising and inpainting with deep neural networks." NIPS, 2012. Cho, KyungHyun. "Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images." ICML, 2013. To give some context, the authors could cite some works studying what auto encoders learn from the data distribution. For example: Alain, Guillaume, and Yoshua Bengio. "What regularized auto-encoders learn from the data-generating distribution." JMLR, 2014. I find the result of Theorem 3 unclear. Using a taylor expansion around the distribution mean of the corrupted sample given the clean sample would only be valid for very small corruptions as term o(\sigma ^2) can otherwise be large. In that case, casting generic DAE in the form given by corollary 2 would be valid only for small distortions. Please clarify. On a separate note, I do not understand equation (7). J_DAE is an expectation over the input distribution while the last term in (7) includes x. So, I am not understanding this notation. I looked at the supplementary material, and it is not very clear either. Please find bellow some comments: - From (28) to (29) two expectations are taken ( in (29) this should be clarified). Calling \Sigma_x the covariance of the corruption is confusing. The expectations in (29) are with respect to x and the one in the definition of \Sigma_x is with respect the corruption. - the second order derivative of the squared loss should be evaluated in x, so it should be in expectation In the experimental section, the data preprocessing should be such to match Assumption 1. This is, the authors should apply whitening instead of imposing zero mean and normalizing the standard deviation. It would be interesting to plot the evolution with the training of the average activation fraction of the units. Minor comments: Although the reader can imagine what it means, it would be good to define before Theorem 1 what exactly means ''updating the coefficients along the negative gradient'' (i.e. stating that eta > 0). Figure 3 is not very easy to read when printed. Eliminate white borders. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Authors study what are the reasons that make an autoencoder learn a sparse representation. Among other details, they study how bias gradient affects sparsity, which activation functions bring more sparse representations and both theoretically and empirically study some existing autoencoders concerning the sparsity of the representations that they provide. Clarity - Justification: Paper is well explained. Significance - Justification: Justification: I disagree with the claim made by the author stating that “Sparsed Distributed Representation constitutes the fundamental reason behind the success of deep learning”, but still is a relevant analysis. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I disagree with the claim made by the author stating that “Sparsed Distributed Representation constitutes the fundamental reason behind the success of deep learning”, and if that is the case, some clarification or proof should be provided. Even though in my opinion the author overestimates the relevance of sparse autoencoders, it is still a thorough analysis on the matter. It is widely known that l1 regularization encourages sparsity but authors bring more information on the matter. Authors also state that “Sparse Distributed representation … captures the generation process of most real world data”. I think this claim needs some clarification concerning why this is the case. Also concerning the loss function, authors state that “the motivation behind this objective is to capture predominant repeating patterns in data”. In such technical analysis of the problem, it is probably worth mentioning that MSE assumes that the data follows a gaussian distribution. Finally, even though the tanh does not satisfy the negative saturation property, since it is one of the most used activation functions, it would be nice to see the empirical evidence of the sparsity level it provides. =====