Paper ID: 258 Title: A Random Matrix Approach to Echo-State Neural Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper the properties of ESN networks are studied based on properties of random matrices. Clarity - Justification: I think the paper is actually well written. However is not very accessible. I do find myself needing to look up certain terms when reading the paper. I guess part of it might be the fact that I'm not up to date with the theory of random matrices, compress sensing, etc. However IMHO, a sufficiently large part of the ICML community might not be, and I truly believe that theoretical papers should put in extra effort to make the paper much easier to follow and read. Also a clear intuitive explanation of what are the main contribution of the work could be sufficient. I think the authors are already doing this, but maybe I guess a more detailed intuitive explanation. From this perspective I think this paper will not have a huge impact just because it will not immediately be accessible to many nor these subset of researcher will fully understand the contribution of the work. Significance - Justification: I think the paper tries to provide some theoretical insight into the role of eigenvalues of the fixed recurrent matrix and activation noise in the model. I did not follow these constructs to the end, but they seem interesting. However, from a practical perspective, I'm not sure this work changed my intuition of the role of these elements. Instead it is offering some theoretical justification of these justifications. Also one downside is that is not clear how learning interacts with these properties. For example we know that the eigenvalues of the recurrent weights controls the amount of memory you have (this had been suggested by H. Jaeger in the original techreport on ESNs) and we know that this observation can be used to provide a good initialization of RNNs. However we know that for many tasks (and model modifications) this initialization becomes less interesting and learning can drive the model in the right region of the parameter space. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Fundamentally this work is a theoretical treatment of the properties of ESN. From this perspective I do not expect a lot empirical evidence. However I would have preferred a more accessible description of the results (possibly even relying on an appendix). Also maybe a more detailed and intuitive discussion of what the results are and their significance. What new insights do they provide? Or what intuition do they justify. Something that would make someone not knowledgeable of the deep results in used in the paper understand at least why the work is significant without actually needing to understand the work itself. At least this is my personal view on theoretical work submitted to main stream conferences like ICML. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper uses tools from random matrix theory to help the analysis of an echo-state network with linear weights and internal noises. The conclusions are further instantiated for various models of the random weights, with simulations and discussions. Clarity - Justification: The tools from random matrix theory is introduced before getting into the actual problem that need to empoly those tools -- while this ordering is a bit unusal, the over presentation of this paper is relatively clear. However, some missing details are said to be in "an extended version" which the reviewer would not be able to access. Import parts like proofs of theorems should at least be included in an appendix. Significance - Justification: I am not an expert about echo-state networks so I cannot assess the significance of this work. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is an interesting paper that borrows tools from random matrix theory to analyze the asymptotic behavior of a special kind of echo-state networks. I think at least the following issues need to be addressed before this could be considered an ICML publication. 1. This paper is about a very specific type of recurrent neural networks -- the echo-state networks, with extra assumptions like linear connection and internal noises. The current title is too broad for this very specific work. 2. There is a mysterious "extended version" of this paper that is mentioned in many places of this paper. Some different outcomes or cases are mentioned but "left to the extended version". While it is generally OK to have an extended version, a conference proceeding archival paper itself should also be an independent and coherent article by itself. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper uses tools from random matrix theory to study the train and test error of linear echo state networks with internal noise. The paper develops quantitative formulae for the train and test MSE as a function of the reservoir matrix and properties of the training setup, in the large-reservoir limit. Experiments show that the theoretical predictions match simulations extremely well even for moderately sized networks. The solutions provide insight into the role of normal and non normal matrices, and the role of internal noise in establishing stability to test-time input perturbations. Clarity - Justification: The paper is clearly written and describes random matrix tools which will be novel to many deep learning researchers (certainly they were to me) with remarkable lucidity, excepting some points I have flagged in the minor comments. Significance - Justification: This is an exciting paper which imports new tools from random matrix theory into the study of learning in recurrent networks. This sort of cross-pollination is extremely valuable. While the formulae relating network structure to performance are not fully explicit, they can be specialized to illuminate special cases, and definitively answer how properties of the ESN network impact performance. The focus on linear networks, in the first instance is fully justified, and the results may be an important prerequisite to understanding nonlinear behavior. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper could be improved by a more thorough discussion of related work. In particular it seems insufficiently generous to say that Ganguli, Huh, & Sompolinsky (2008) discuss only qualitative findings. The Fisher memory curve does quantitatively capture an aspect of the decay of the memory over time, though it is not directly the train/test MSE. Moreover, in the section on normal and non normal W, it is never mentioned that the significance of normal and non normal matrices was substantially investigated by Ganguli et al. The significance of orthogonal matrices was also studied by White, Lee, and Sompolinsky, 2004. The concluding remarks mention “debates in the field,” for which relevant citations would be helpful. It would also be useful to mention recent approaches which share certain features like Pasa & Sperduti “Pre-training of Recurrent Neural Networks via Linear Autoencoders” NIPS2014 or Arjovsky, Shah, & Bengio. “Unitary Evolution Recurrent Neural Networks” ArXiv 2015. There are some points where the paper appears to over promise. The introduction promises that “we shall understand deeply the impact of normal versus non-normal matrices” but in the relevant results section, we are told “the observed performance results suggest an outstanding performance advantage of non-normal versus normal matrix structures, which might deserve deeper future investigation.” Indeed, the normal/non-normal results do not appear to give rigorous insight into the differences between these two cases; it is stated without elaboration that the checkerboard pattern for R in the normal case “suggests an inappropriate spread of the reservoir energy” but this is far from clear. The level of insight into normal/non-normal matrices seems on par with that given by Ganguli et al. The statement that “past works merely provided insights based on incomplete considerations” is probably also over-strong. Minor comments: Line 204: Clarify what (q)^+ indicates Line 432: It would be useful to briefly describe the Mackey-Glass model Fig. 2: It may be worth commenting on the relation between test and train performance, which appear to be a constant factor apart across the range of \eta. In particular, it would be helpful to situate this in the context of the underfitting/overfitting terminology common in machine learning; why is overfitting never observed in this framework? Intuitively, if \eta is acting as a regularizer, one might expect that a nonzero \eta would be optimal (as in the case of Fig. 7). Section 4.1.2 opening paragraph: The repeated use of ‘latter’ and ‘former’ makes this section hard to follow (and perhaps one ‘former’ should really be a ‘latter’?) Fig. 7: Make the blue circles indicating theoretical minima red, so as to associate them with the theoretical curves rather than monte carlo. ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides a theoretical framework that can be used to analyze the performance of linear Echo-state Networks with internal noise in both the training and the testing phase. They evaluate the performance of these networks while varying parameters and characteristics of the network connectivity matrix (reservoir). The quantitative metric of performance is the normalized mean-squared error between the real output and the estimated output. They show that the theoretical performance of the model matches Monte Carlo simulations while varying aspects such as internal noise level and normality of the connectivity matrix. They also introduce a ”Multimemory” connectivity matrix which is a block diagonal combination of Haar matrices with different scaling factors, and which has performance that is better or comparable to all the scaled Haar matrices that make it up. Clarity - Justification: The system was forcing me to make above/below average choices here, but I would rate it average. I understand that the details can't all fit, but I found the author's tried to cram things into the paper that didn't make sense without the extra details and would just cite the extended paper. Significance - Justification: The authors are correct that recurrent networks are an important area of study, and they have provided an analysis on the training/testing performance that seems to match well with the few experiments they have shown. It would be nice to see a more extensive simulation exploration. My biggest complaint with the results is that they have an implicit instead of explicit character, and I think that limits the insight that one is able to draw from the theoretical results. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors are understating the amount of quantitative work that has been done on ESNs. Even the cites that they give in the introduction have quantitative analyses in them. Beyond that, the authors haven't acknowledged at all the previous work in the literature that also uses random matrix analysis to investigate ESNs (though more in the memory capacity context than the present context): Ganguli, S., & Sompolinsky, H. (2010). Short-term memory in neuronal networks through dynamical compressed sensing. In Advances in neural information processing systems (pp. 667-675). Charles, A. S., Yap, H. L., & Rozell, C. J. (2014). Short-term memory capacity in networks via the restricted isometry property. Neural computation, 26(6), 1198-1235. The setup of Thm 1 is very unclear to me, including the intuition of the basic notation they're using (<->). It would be better to cut some details (which are mostly punts to the longer paper) and make things more clear. As an example, much of section 4.2 is really hard to make sense of without reading the longer paper. =====