Paper ID: 951
Title: Recurrent Orthogonal Networks and Long-Memory Tasks

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides theoretical insights into the recent success of orthogonal recurrent matrices for standard RNNs. The authors provide a "proof by construction" that a particular orthogonal initialization can solve the standard "copy" task introduced by Hochreiter, while another identity initialization can trivially solve the "addition" task. They then highlight the importance of initialization, by showing how RNNs initialized for the copy task failed to solve the addition task, and vice-versa. The authors propose a novel  mitigation strategy, which is to incorporate pooling in the output layer: the rationale being that L2 pooling can discard phase information for tasks which are not "cyclical" in nature (e.g. "addition" task).

Clarity - Justification: (see detailed comments)

Significance - Justification: (see detailed comments)

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I am left somewhat confused as to the contribution of this paper. Construction proofs are useful to show that a given model family is rich enough to solve the task, leaving open the question of whether optimization can find this solution. However, in this case, unitary RNNs have already been shown to learn the copy task. The step-by-step guide to processing of Section 3.1 definitely helps in  understanding what is going on under the hood, in terms of low-level mechanisms, however this seems to simply confirm the already known connection between orthogonal recurrent matrices and implementing memory via subspace rotations.  A direction of the paper which I found very interesting is the inability of the orthogonal RNNs to solve the variable length copy task, and the observation regarding the pitfalls of synthetic tasks. This seems to imply that novel memory architectures ought to be benchmarked against a broad set of experiments, to ensure that the properties of the model generalize across tasks. However, the authors then move on to test their pooling RNN solely on the copy & add tasks for which they were explicitely designed! Is pooling in the output layer a generally useful concept which readers should adopt ? Or is this a specific solution to a specific problem ? The paper unfortunately does not provide an answer.  Finally, I found the paper generally difficult to follow. There are quite a few instances where notation is not defined properly, and typos in the latex math. Section 3.1 in particular was difficult to follow: it would benefit greatly from a high-level intuitive description of the initialization strategy, before diving into the step by step guide.  Other:  * line 140: N is not defined to be the input dimension * line 181: remove superfluous reference to (Wojciech Zeremba, 2014), as the   architecture is the same as Hochreither 97, the original inventor. * line 294 vs eq 7: \tilde{U} should be a 2d x K matrix (not Kx2d), if   right-multiplied with x (as in eq 7) * line 315: notation u_i_j is *not* defined anywhere. I assume this is the   row of U obtained when multiplying U by a symbol a_i appearing at the j-th   timestep. Very poor notation. * Line 315 completely ignore the input non-linearity. If the derivation assumes   a linear encoder, this should be stated explicitely. * line 323: shouldn't h_{d+1} be h_{2d+1} ? * line 414: s/regurgitating/copying * line 460: should x_j[2] be x_j[1] ? * line 517: s/through/throw * Section 4.1: acronyms LT-ORNN, LT-IRNN are never defined. * line 693: what is a delta-like oscillation ?? * Figure 2,3,4,6: graphs are very noisy and difficult to read. Authors should present   mean and error bars across random initializations.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes different ways of solving the pathological tasks(Hochreichter 1997) for RNNs by using orthogonal transition weight matrices and different initialization techniques. Authors  experimentally validate their RNNs solutions on "copying" and "addition" tasks.

Clarity - Justification: The paper is very well written and the explanations are very clear.

Significance - Justification: The fact that authors mainly focused on solving toy tasks, reduces the significance and the impact on the research community. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The direction of investigating the use of, or sometimes for initialization, different types or families of structured matrices as the weights matrices of an RNN is a very interesting and promising research area.  However, authors of this paper explored this area, by studying only solving very simplistic pathological toy tasks. Those tasks used to have some relevance in the past, but now we know that with the right initialization, enough capacity , right hyper-parameters and the model architecture these tasks are relatively easy to solve for RNNs.  I believe those tasks lost in relevance and they are not really interesting since they ignore most of the important aspects of a real-world tasks that RNNs are being used to solve. To be able to solve these tasks, the model just needs to keep the information in the memory for a certain period of time. Nevertheless, the applications of interest for RNNs usually require more sophisticated operations on the memory, such as deletion of irrelevant information from the memory and the ability to deal with the noise and removing irrelevant information from the memory.  The simplifications that is made in LT-RNN to make the math workout, changes the model significantly and it is really not clear whether if the same proposals would work on sRNN as well. The LT-RNN is basically a linear-RNN with a nonlinear feature extractor. It would be more useful for the community to show whether the proposals work with sRNNs as well, at least in practice.  In line 234, authors mention about decoder. Which decoder?  How did you initialize LSTMs in your experiment?   Authors should cite, Saxe 2016 for the orthogonal initialization of the weights.  It would be interesting to see a plot/figure showing the relationship between the capacity and the length of the sequences. I would assume that given enough capacity and the proper initialization, unitary/orthogonal RNNs might be able to solve these tasks up to an arbitrary length, since the memory of RNN increases with respect to the number of hidden units it has.  Figure 5 is very difficult to draw any conclusion from.  Minor comments: * I think it would be interesting to include unitary-RNNs to compare in your learning curves. * I would like to see the relationship between the techniques proposed in this paper and Echo-State-Networks. * Have you tried using ||VV^{\top} - I||_2^2 as a regularizer?

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper investigates the performance of recurrent neural networks on several simple benchmark problems. Using a simple recurrent network architecture, it constructs hand-designed weights which solve a memory copy task and an addition task. These solutions provide insight into the computations and solution methods which a trained network may adopt. In particular, one method makes use of a nearly orthogonal transition matrix, while the other method makes use of an identity matrix. Based on this observation, the paper investigates the role of initialization in allowing learned networks to solve the task, showing that orthogonal and identity initializations perform vastly better on memory copy and addition tasks respectively. Finally, the paper shows that a particular pooling-based architectural mechanism can combine the benefits of both initializations, allowing a simple network to learn both addition and memory copy tasks.

Clarity - Justification: The paper is clear and well-written.

Significance - Justification: This paper contains many useful insights into the operation of RNNs. In particular, the hand-crafted solutions to two benchmark problems yield a good picture of the style of computations carried out by the RNN. The decisive role of initialization, which has been hinted at in prior work, will be interesting to many.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):  The pooling operation is a very interesting architectural idea in the context of recurrent networks, which enables a subset of neurons to only consider the magnitude, not the phase of a signal. This is an idea (partially embodied in Arjovsky et al.’s architecture) which may find wide application.  There is a variety of related work which should be cited that touches on memory properties of orthogonal matrices in recurrent networks and as initializations for learned networks. In particular, White, Lee, & Sompolinsky. (2004) Short-term memory in orthogonal networks. Phys. Rev. Lett. and Ganguli, S., Huh, D., & Sompolinsky, H. (2008). Memory traces in dynamical systems. Proceedings of the National Academy of Sciences of the United States of America, 105(48), 18970–5, both discuss orthogonal transition matrices in a related setup. Regarding orthogonal initializations to address the vanishing gradient problem, this was proposed and analyzed in Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Y. Bengio & Y. LeCun (Eds.), the International Conference on Learning Representations. Banff, Canada.  The authors have considered adding a pooling stage to allow one architecture to inherit the properties of both initialization types. Another option would be to initialize with a block-diagonal transition matrix, part of which is the identity and part of which is orthogonal. This may provide an interesting benchmark, and small random elements in the off-diagonal blocks might allow whichever initialization is best performing to share information with the other.  It may be worth pointing out that an identity matrix is itself orthogonal; the underlying distinction here may be between symmetric and asymmetric orthogonal. In light of the results of Ganguli et al., 2008, it may be interesting to also compare to the performance of non-normal matrices.   Finally, a key simplification in the work is the linearity of the transition update. Few other RNN methods use this. It may be interesting to try orthogonal initializations in the standard nonlinear setting (the sRNN), which would enable even orthogonal matrices of spectral norm greater than one to generate stable but chaotic activity over time.  Minor comments: Several citations contain only the first author of multiple authorship papers (eg ln 87, Martin Arjovsky, 2015; Quoc V. Le, 2015). 

=====