Paper ID: 608
Title: Strongly-Typed Recurrent Neural Networks

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a new design principle for machine learning and especially for recurrent models, where each vector has a type, and say multiplying it by an unrestricted matrix always produces a new type. Recurrent neural networks do not follow this principle, as a the horizontal mapping from previous state to the current state cannot stay in the same type (type of the state). Connections are drawn to the principle of dimensional homogeneity in physics and principles in functional programming. Experiments on three new architectures designed using the principle, do not improve state of the art in all cases, but show that the principle is clearly useful.

Clarity - Justification: Paper is clearly written.

Significance - Justification: Paper opens the door for an easier design of new architectures for recurrent models that would be constrained and thus expected to be more well-behaved. This might be a crucial step in the development of neural Turing machine -type of solutions.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think that the physics comparison might not be completely analogous. Say, we have a system of springs and masses. The state consists of the locations and velocities of the masses. By solving the differential equation with some delta time, we would get the next state as a function of the previous state, with some kind of weight matrix that would have units (spring constants, masses, time) that would cancel out appropriately. So sometimes you can get back to the same type after all. Perhaps the analogy could be further clarified.  I found the use of "spaghetti-like mess" a bit inappropriate for scientific text.  The citation style on lines 312-316 could be better. Zaremba and Laurent should be \citet instead of \citep, and Ioffe & Szegedy should be immediately after Batch Normalization.  In Section 4.1 "size of the vocabulary" should be "size of the alphabet"  I was hoping that evidence of better behaved gradients would be shown in the experiments. This might require more space, though. Perhaps at least you could mention, whether some optimization tricks such as gradient clipping were needed in each case.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes using typed algebra for semantically interpretable recurrent models, and typed variants of the commonly used recurrent neural network formulations (Elman type RNNs, LSTMs and GRUs).

Clarity - Justification: I think the paper is structured very nicely. I also really liked the use of notation, it is simplistic, unambiguous and consistent throughout. Type structures for the typed variants make it very easy to follow the formulations and what the type transitions are. The authors clearly define their initial motivations for pursuing a type-consistent way of designing neural architectures.

Significance - Justification: I think this paper is a good contribution. The most important novelty is providing a new perspective to guide architecture designers. The results, even if not very strong, is promising enough to show that this is a good first step and provides enough justification that the formulations proposed deserve further exploration and research.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Notes:  One thing that confused me was the initial explanation of how Elman-RNNs violate type safety. E.g. in the type structure shown (lines 300-303), how do we conclude that lat_2 and h are different types? Does this solely come from the axioms of typed linear algebra? I was under the impression that when training an RNN, we would assume that these lie in the same vector space. I.e. RNN update equation, when considered to be a function (parametrized by x_t) of memory / state h_(t-1), would be a function from the state space to itself. Isn't this enforced by using the same output layer regardless of timestep (so assuming all outputs y_t lie in the same space, since we apply the same function to h_t to get y_t, all h_t, (t \in {1 .. T}) has to lie in the domain of the output function/layer)? Similarly, all x_t interacts with h_t the same way regardless of the actual timestep, therefore a similar argument applies.  Any elaboration or clarification on this point would be appreciated since it is the main motivational crux of the paper.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper strongly-typed linear algebra, inspired from ideas from functional programming and physics (dimensional analysis). This theoretical framework allows the authors to define strongly-typed variants of popular RNN variants, such as GRUs and LSTMs as well as the classical Elman RNNs. The main claim of the paper is that these strongly-typed RNNs architectures have much better properties, such as having well-defined and interpretable gradients, and optimize better. The paper empirically demonstrates that strongly-typed RNNs optimize better, i.e. get to a lower training error with the same number of parameters as compared to their inspiration sources. The experiments are on word-level and character-level language modeling on some small text datasets.

Clarity - Justification: About 75% of the paper discusses strongly typed linear algebra, while the rest details the strongly-typed RNN architectures and experiments. Although this theoretical framework helps readers understand these RNN architectures, it doesn't crucial to have the theory be discussed so prominently in the paper. When judging a new machine learning model, the most important factors to consider are its theoretical properties (such as relations to dynamic temporal convolutions) and empirical performance. How the machine learning model was derived from inspiration from other fields does not seem crucial at all, and is just a nice-to-know fact, which could have been dealt with in one paragraph before the conclusion.  If this paper did more empirical investigation of the RNN architectures and devoted space to that instead of deriving strongly-typed RNNs, this paper would have been more useful to the machine learning community.

Significance - Justification: The paper has significant contributions. Recurrent neural networks are still an under-developed machine learning model. The paper introduces a significantly better class of RNN architectures, which have the potential of replacing the state-of-art RNN architectures such as LSTMs, since these strongly-typed variants are able to optimize much better with smaller number of parameters. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): But, what's stopping these architectures from that is   (1) the nonideal presentation, i.e. instead of detailing the theoretical framework which has merely "inspired" these architectures, emphasis instead on the actual theoretical and empirical properties of these architectures, and   (2) lack of a more thorough empirical investigation on real RNN tasks such as speech recognition, and large-scale language modeling on one billion dataset.

=====