Paper ID: 516 Title: Unitary Evolution Recurrent Neural Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a method for parametrizing orthogonal matrices by moving to the complex domain and using unitary matrices, which generalize orthogonal matrices. This mitigates the vanishing/exploding gradient problem, as unitary/orthogonal matrices combined with a suitable non-linearity are norm preserving in both the forward and backward pass. To make the use of unitary matrices computationally tractable, the authors define a reasonably flexible class of unitary matrices by composing several strongly restricted types of unitary matrices. The resulting approach can be implemented in standard frameworks without too much trouble, by representing the required complex numbers using pairs of coupled real numbers. Experiments are presented showing that settings exist in which the proposed method significantly outperforms standard alternatives. Clarity - Justification: The material was presented in a straightforward way, without unnecessary bravado. Significance - Justification: The vanishing/exploding gradient problem is certainly a valuable place to make contributions. Controlling growth of the hidden state and the backpropagated gradient, without resorting to rapidly saturating non-linearities, seems like a promising approach to reducing the effects of v/e gradients. This paper applies some neat tricks to force forward activations and backpropagated gradients to neither explode nor vanish over many recurrent steps of computation, without relying on saturating non-linearities. This is a useful step in the right direction. The experiments are on "toy" problems though, and it's not clear how performance benefits on the copy task will transfer to, e.g., real-world language modelling. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Alternative approaches to parametrizing the recurrence in an RNN are worth exploring, especially when they take a large step away from introducing minor variations on the LSTM. Between the LSTM and GRU, that region of design space seems well-enough covered already. See, e.g. "LSTM: a Search Space Odyssey" by Greff et al., or "An Empirical Exploration of Recurrent Network Architectures" by Jozefowicz et al. What reduces my enthusiasm about this paper is the lack of experiments on "realistic" problems. It would be nice to see if the proposed technique outperforms existing approaches on any tasks of practical interest. Similar sets of toy tasks have been used in the past to support research on, e.g, Hessian-free optimization. So far, that work hasn't significantly impacted practical applications of RNNs. That said, I think research on new architectures is more likely to produce large jumps in practical performance than improved optimizers. Of course, this is barring a miraculous breakthrough in optimization. Another concern I have is that what's gained by introducing an efficient way of using unitary recurrent matrices is bought at the cost of heavily restricting the representational capacity of the recurrence. While several papers have shown that deep and recurrent networks often waste an unpleasant chunk of their (computationally costly) representational capacity, the restrictions imposed by the proposed method are quite severe. Some minor typos: - There's an extra vertical bar in Eq. 8 - Line 744: propogates -> propagates - Line 694: unpermutted -> unpermuted ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper extends previous work on using orthogonal weight matrices to improve optimization in neural networks. The authors introduce a technique for efficiently parameterizing unitary matrices for recurrent networks. This enables training RNNs for control experiments that demonstrate long range memory and modeling capabilities at least as good as LSTMs, and far better in some cases. Clarity - Justification: The writing is strong, and the technique is described in sufficient detail that others should be able to reproduce the experiments. Significance - Justification: This is an interesting algorithmic advance for recurrent neural network architectures. The experiments don’t fully validate whether this approach can advance the state of the art for challenging tasks, but this approach is certainly worth further investigation and could turn out to be an important step. It won’t be very useful for those outside of the core deep learning research community, but that’s not a major downside. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Given that the proposed approach requires some relatively expensive computation compared to a regular RNN, it would be good to show training times for some of the experiments. The proposed approach performs well, but how much extra training time is required? The tasks used for evaluation are sufficient as a way of introducing this approach, but the paper would be much stronger if results on a language modeling or similar task were added. Given the enhanced memory properties of the proposed approach, it would be really exciting to see how much it improves an RNN character language model (for example). ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes using norm preserving matrices for the recurrent connections in recurrent neural nets, to overcome/alleviate the problem of vanishing / exploding gradients. Upon arriving a difficulty of parametrizing such matrices, the authors propose a specific family of such matrices, parametrized by small number of parameters. Experimental results are very impressive and show that the uRNN model is effective in handling long-term dependencies, and beats other models (in particular LSTMs which were also originally proposed to handle this issue) on several tasks. Clarity - Justification: The paper is clearly structured. Problem of vanishing / exploding gradients is properly motivated and explanations of why norm-preservation would solve part of the issue were provided with formal proof. Later the authors describe why arbitrarily parametrizing these matrices in a learnable fashion is difficult and propose an alternative, simpler parametrization. Experiments are relevant and the discussion of results is convincing. Significance - Justification: I think the contribution of the paper is potentially very impactful. Vanishing / exploding gradients is a long-standing problem, and uRNNs seem to be handling this better than any of the commonly used baselines. Notably, LSTMs have been used for a long while with similar motivations, however uRNNs can be successful even in the cases where even LSTMs fail. Therefore, uRNNs can have potentially a large impact on sequential prediction tasks where very long term dependencies are useful or essential. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In general the paper is well written, has very good contribution and has impressive experimental results, therefore deserving participation. In additions to the proposed model here, it provides new perspective on tackling vanishing/exploding gradients, which can expand into a plethora of future work. The work is grounded formally, and the authors provide formal proofs without only relying on experimental results (which also strengthen the conclusions). Experiments are detailed with hyperparameter choices for reproducibility. The specific solution of parametrization of unitary matrices is motivated instead of looking arbitrary: It seems to be a complex enough formulation that has simple enough time and storage complexity. =====