We would like to thank the reviewers for their insightful comments, and the enthusiasm towards the paper is greatly appreciated.$
In this work we did indeed focus on long term memory problems to test the strengths and weaknesses of the approach. The goal was to study more thoroughly the problem it tries to solve, the pitfalls current architectures fall on, and to present a new direction of research that might overcome them. We wanted to focus on the novel parameterization, and explore its behaviour more in depth, as we believe this is more fundamental to scientific progress than simply shooting for incremental state-of-the-art performance.
It is likely that an end to end solution will involve elements both from our new approaches and the extensive and relevant literature on gating mechanisms such as the ones in LSTMs and GRUs, and we are happy to explore these directions in future work.
Regarding capacity, while it is true that in principle it may seem restrictive, “ACDC: A Structured Efficient Linear Layer” by Moczulski et al. shows that composing simple layers like ours (with real diagonals) they can represent any matrix (though they show that wastes capacity). Of course, we impose the additional unitary restriction, though there is plenty of literature (see for example http://www.theory.caltech.edu/people/preskill/ph229/notes/chap6.pdf section 6.1.3) that shows universal computation can be achieved with gates that use only unitary matrices.
We therefore do not think capacity is an immediate concern, especially if we combine the uRNN with other components in a gated RNN structure with the advantages of LSTMs and GRUs.
In general, it is true that this work will appeal most to the deep learning community, but our parameterization of a unitary matrix is novel and may be useful to a range of researchers.
Computation time is discussed at a certain length in the paper, but it is true that we should display wall clock times, and they will be presented in the final version. Overall the models were about the same speed. The uRNN requires extra computation because of the logn term introduced by the FFT, though the models were smaller than the LSTMs, even with the uRNN increased performance.