We thank the reviewers for their useful comments. The final version will be clarified and improved per the reviewers remarks, see also comments below.$
R1: TYPE VIOLATION
The type-violation in Elman-RNNs arises as follows. Linear transform V decomposes, via SVD, into orthogonal transform Q^T, followed by diagonal rescaling D, followed by orthogonal transform P. Each orthogonal transform is a change-of-basis. The reviewer is correct that lat_2 and lat_4 are not necessarily different types: if V is symmetric then P and Q^T cancel out, see "Type-Preserving Transforms" starting line 325. However, if V is not symmetric (which is typical) then lat_2 and lat_4 are different coordinate systems. Using the same output layer on every timestep does not solve the problem.
 
As an analogy, suppose lat_2 represents numbers in binary, and lat_4 in hexadecimal. If you add binary and hexadecimal numbers naively — that is, treating the binary representation as hexadecimal without converting — then the result is meaningless. If you did this regularly then you might adjust, more or less, to the resulting behavior. But you’re better off just working in hexadecimal.

R2: FRAMEWORK
R2’s main criticism is that the paper is too theoretical. We make three remarks. Firstly, analysis is important. LSTMs solve a specific problem, vanishing and exploding gradients. Hochreiter’s paper included detailed analysis. Strong-typing solves a specific problem: incompatible bases in RNNs. We explain the problem at length from two perspectives (dot-products and types) since it is subtle.

Secondly, recent works by Greff and Jozefowicz apply massive computation to find simpler alternatives to LSTMs. No significant improvements were found. More broadly, design principles for RNNs and hybrids (neural Turing machines, queues, deques, etc) are lacking. The paper adapts established principles from physics and programming to the quite different domain of RNN design. It therefore requires careful explanation.

Thirdly, although limited in scope, we believe the experiments make a compelling case for our (i) well motivated (ii) small tweaks which (iii) significantly improve interpretability and (iv) slightly improve training performance. 

R3: PHYSICS
R3’s example is exactly right: DEs that model physical systems are constructed so units cancel out. The paper discusses combining and transforming types. Inverting types (e.g. meters per second) requires techniques that we currently lack. The book by Hart, “Multidimensional Analysis”, extends linear algebra to incorporate units and may be useful in this regard. We defer the question to future work.

R3: GRADIENTS
The results reported in the paper use gradient clipping. We checked the effect of removing gradient clipping on medium-sized LSTM and T-LSTM on the PTB dataset. T-LSTM gradients are well-behaved without clipping, although test performance is not competitive. In contrast, LSTM gradient explode without clipping and the architecture is unusable. Since T-LSTM gradients do not explode, it is possible that carefully initialized T-LSTMs may be competitive without clipping. We defer the question to future work.

R3: MISC
We will remove the reference to a “spaghetti-like mess” and correct typos.