Thank you to all for your comments; this work will be better because of your careful attention.$ to R_3 We very much agree that these toy tasks alone are insufficient for evaluating memory architectures. This is consistent with two of the main points of the paper: The methods in the recent references in lines 96-108 solve the tasks to only varying degrees of success, and most do not solve them for the sequence lengths reported in our paper. Since the linear transition RNN will likely not be useful in many real world tasks, yet solves the copy and addition tasks comparably or better than sophisticated architectures like the LSTM, we show that solving these tasks should not necessarily imply success in more complex real-world tasks (where LSTMs perform better than LTRNN). Although solving these tasks alone is not sufficient to validate a model, we study their properties and limitations because many authors still use them for evaluation, and because we do believe that measuring a model's performance on them can give insight into its capabilities. Simple, explicit solutions (and showing they can be learned) help illuminate exactly what the tasks are evaluating, and help explain other researchers' experiments on these tasks. This is an important contribution. W.r.t capacity vs. sequence length: see Figure 1 and Sections 3.1.1 and 3.1.2. The size of the hidden state grows linearly with the length of the sequence to be memorized (S), grows logarithmically with dictionary size (K), and does not grow with the length of time to remember the sequence (T). So if the sequence to be memorized is 10 symbols, up to numerical errors it does not matter if it needs to be remembered for 100 or 1e6 steps. To address your other comments: For all models, each transformation matrix is initialized from N(0, 1/sqrt(number of inputs)). We will cite Saxe 2016 and add unitary RNN performance to the plots. We did experiments with ||VV^T – I||^2 as a regularizer, and it seemed to help in some cases but not others. More are needed to better understand its effect. to R_4 Thank you for the references, we will update the paper to put our work in the context of these works. Concerning the identity being orthogonal: as far as we can tell, the difference between random orthogonal and identity initializations is that in the first case, the eigenvalues are uniform on the unit circle, and in the second, they are delta at 1. Note that this gives a very good explanation of what is happening in (Arjovsky 2015), where the unitary RNN (initialized with eigenvalues on the unit circle) works well on the copy task but less so on the addition task. We did do some experiments with orthogonal initializations and non-linear transitions, and found that these did not solve the tasks for long timescales. This illustrates the fact that there are two causes of the vanishing gradient problem: the exponential reduction of the norm due to the repeated application of the transition matrix, and the zeroing of the gradients if they lie in the saturated region of the non-linearity. to R_6 We will fix notation and acronyms, clarify that we are assuming a linear encoder for the copy task, and give a high-level overview of the solution in Section 3.1. Also, line 693 should be "that are perceived as a $\delta$". Our experiments are designed to give a partial answer to your question about "whether the optimization can find these solutions". They show that the initializations corresponding to the relevant solutions are crucial for solving the tasks. The theoretical part of our work shows why earlier works with unitary RNNs can find the solution to the copy task: as long as the initialization has eigenvalues uniformly distributed on the unit circle, it is very close to a solution of the task at initialization. It suggest an explanation as to why the unitary RNN initialized this way performs worse on the addition task. W.r.t. your question on the pooling architecture and the variable length copy task: it does not help solve it. We will include this in the paper. More generally, regarding the pooling architecture, we do believe that l2 pooling as a mechanism for allowing the model to choose between cyclic processing and identity processing is interesting, and we believe it will be more generally useful. However, the experiments we report support our discussion about the explicit mechanisms and their relationship with the tasks. They add evidence that the difficulty in optimization lies in the model transitioning between oscillatory and steady dynamics, and if a means to easily make this transition is included in the architecture, the model can learn both the tasks. We agree that more experiments are needed to determine for which tasks it will be generally useful, and we will explore more practical applications of the pooling mechanism in further works.