Paper ID: 920 Title: Associative Long Short-Term Memory Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The main contributions are: - This paper proposes a new recurrent network architecture in which the LSTM cell state is replaced by an assoicative memory. New information can be added into the memory by adding the product of a key-value pair. This information can then be queried from this memory by multiplying with the complex conjugate of that key. Redundancy can be built in by having multiple such memories, each using a different permutation of the key. - The experiments include toy tasks as well as some language modeling tasks designed to test different aspects of the system. They show that LSTM models with this kind of memory are able to remember more information and learn faster than standard LSTMs. Other powerful baselines are also compard to. Clarity - Justification: The paper is well written with clear explanations. The choice of tasks used for the experiments is judicious and the discussion of the results is insightful. Significance - Justification: Incorporating an associative memory in LSTMs using complex-valued key-value pairs is a novel contribution. This form of storage has desirable properties which makes it amenable to distributed information processing systems like neural nets. Intuitively, it seems easier to rememeber something about a memory (a "key"), rather than remembering where the memory is stored (an index in table). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper proposes an interesting and novel way of adding associative memory in recurrent nets. This has several nice properties, for example, - Redundancy can be increased to improve retrieval accuracy, without affecting the number of trainable parameters. - There is a smooth trade-off between retrieval accuracy and amount of information stored. Experiments clearly show that this improves the performance over standard LSTMs as well as most other baselines (in expected ways). Some suggestions for discussion - - Would having sparse keys help reduce interference ? - How does the distribution of keys and values affect the storage ? For example, if the values were from a dense distribution but the keys were sparse ? What do the learned distributions look like ? - In the current model, the dimensionality of the retrieved memory is the same as the dimensionality of the key. Would it make the model more powerful if this was not a constraint ? It seems like we should be able to create hierarchical data stores - an incoming key does a look-up and gets concatenated (or combined) with the result of the look-up to create a bigger key that can then look-up a larger memory. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a new method to augment recurrent neural networks with extra memory without increasing the number of network parameters. The proposed RNN architecture implements an associative memory using limited capacity storage, which becomes noisier with each retrieval. A noise-reduction strategy is proposed using redundant copies of stored information. Experiments on synthetic benchmarks as well as real data demonstrate the benefits of the proposed approach. Clarity - Justification: Overall I found the paper quite easy to follow with a few exceptions noted in the detailed comments. Probably the point in Section 7 (Why complex numbers?) should be moved earlier in the paper. Also, the background section could probably benefit from some kind of illustration. Significance - Justification: I think this is quite an interesting approach and the experiments do a good job of demonstrating its effectiveness both in synthetic tasks and on real data. My main reservation is that perhaps more space should have been devoted to real tasks, and that some hyperparameter choices probably deserve more justification and/or experimentation. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Section 6 - Mini-batch size of 2 is an interesting choice and perhaps needs to be motivated better; the concern is that different methods may be more or less susceptible to variance in the gradient during training. Similarly, gradient clipping is a fairly standard approach and the decision not to use it might place LSTMs at a disadvantage compared to the proposed approach -- I'd like to see a sentence or two of motivation here (perhaps just as a footnote). I assume it actually *was* used for the Wikipedia prediction task which uses stacked layers? - Unless I missed it, I don't believe the paper compares to RNNs augmented to attention mechanisms. This seems like an omission given the motivation in Section 1. Why was this not done? - For the arithmetic task it was necessary to augment the model to perform multiple reads at a time. However, the effect of this generalization was not investigated in other tasks. Is there any reason to expect the generalization to be helpful in other settings? - Figure 10: some of the curves stop before others. I assume that this is because some held-out dev data was used as a stopping criterion, but I don't believe this is actually stated in the text. - I would be interested to know how the associated LSTM behaves as the network size is adjusted. Section 7 - Perhaps move this point to earlier in the paper Section 8 - One concern for the adoption of the proposed method as a neural network building block is that the correct and efficient implementation of associative LSTMs proves challenging. Are there any plans to release code to reproduce the experiments in the paper in an open-source NN framework? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a differentiable associative memory structure borrowing from ideas in holographic reduced representations. In particular, the authors proposes a modification of LSTMs to enable sequential reading and writing using a binding scheme in complex space in place of the normal cell update "write"/ gated tanh "read". The authors perform experiments on a wide range of synthetic tasks in addition to character based wikipedia experiments. Clarity - Justification: There is a bit of background required to understand this paper, but the main ideas are reasonably clear. Significance - Justification: Overall, I think this paper is a solid contribution to the growing body of work on sequence modelling tools and should be accepted. The authors provide a novel solution to augmenting current models with associative key-value type structures. Their solution is end-to-end and introduces new tools that will likely influence other researchers. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I left reading the paper wondering about comparisons to Neural Turing Machines for which the authors make reference to when describing the advantages of their model, but do not compare to experimentally at all. An external memory based architecture (e.g MemNet, stack augmented RNNs included) is in some ways a much appropriate baseline compared to uRNNs or permutation RNNs. I also left the paper wondering if using a different (larger) mini-batch size (2 seems like a very particular choice), changed the experimental picture at all. Addressing these points in the paper would make the experimental section more convincing. =====