Paper ID: 839 Title: Meta-Learning with Memory-Augmented Neural Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a training procedure that allows already existing Neural Turing Machines to be applied in a meta-learning setting. In particular, they show that the method is useful to allow efficient learning with small number of instances of each classes. Clarity - Justification: The paper is very well written and motivated. Significance - Justification: The paper describes an interesting application for NTMs, that allows to use them in cases were only small number of examples from each classes are available, but when one has the ability to learn across tasks. The learning setting presented and its general applicability are original and well motivated. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It would have been interesting to describe real-world applications where the setting described by the authors is useful. Both tasks presented in the paper are somewhat artificial in that regard. The NTM memory usage behavior induced by the time-shifted presentation of labels is relatively straightforward. Wouldn't it have been possible to initialize the parameters to encode that behavior right from the start of the training? In some sense, all the tasks are very similar because of the repeated use of inputs with similar structure. However, they are also very different because the labels are drastically changed by complete randomization between episodes. This leads to a setup somewhat similar to transfer learning. Would it be relevant to consider this method in a somewhat transpose setting closer to domain adaptation, where the input representations would be different between each episode? I'm wondering to which extent the particular experimental setting favors this architecture. In a more realistic meta-learning scenario, one would expect less direct similarity between the input representations in each episode. On the other hand, the full randomization of labels seems also extreme. In a more realistic meta-learning scenario, one would expect some similarity between the labels across episodes. I'm wondering if this architecture would still succeed in such a context. The discussion above seems to be related with the persistent memory interference issue discussed in Section 4.2.1. I could imagine some scenarios where more subtle strategies for reuse of memory between episodes might actually prove to be beneficial. In Eq. 4 you access the whole M_t at every step. This doesn't seem to scale to large memories. I'm not an expert in database research, but I'm sure that the similarity type of memory retrieval expressed in Eq. 4 has been explored already. It would be interesting to include prior art in this area as well. I appreciate that the the 1-NN benchmark in Table 2 goes in that direction, but it seems to be a problem that has implications beyond the ML community. -Line 374. It is unclear to me what "no further learning" exactly means. I assume you still read/write in memory, but you stop to update the parameters of the controller with gradient descent? -Table 1: How come the feedforward and LSTM are better than random on the 1st instance? The "educated guess" explanation doesn't seem to apply to these models. Please give slightly more details about the human experiments. How many instances, episodes, how many humans, etc. Line 493 describes MANN with and without LRU, but it is not clear what is referred to in table 2 (they seem to refer to experiments with different number of classes). -Line 744: It is not clear to me what "output variance" is referred to, from Fig 5 a,b. Was there many training/prediction runs on the very same generated function? -Fig 5 a,b, please describe explicitly in the caption the blue, red, black lines. -Line 751. I assume here you refer to the log-likelihood according to the original GP models that were used to generate each function. Please clarify. -Lilne 802: I didn't see results in the paper comparing to LSTM on the regression task -Lines 172, 646: typos ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a form of the Neural Turing Machine that is particularly well suited to one-shot learning. This differs from the vanilla NTM in providing an external memory access mechanism that is better suited to episodic learning tasks with very few examples. Experimental validation is provided on the Omniglot dataset classification task, and some further validation provided by a comparison with Gaussian processes for regression. Clarity - Justification: The paper is in general well written and easy to follow. However, I found myself referring back to Graves’ et al. NTM paper for additional details, which could usefully be quickly summarized in the present paper for clarity and self-containedness. Significance - Justification: One-shot learning is an important problem in machine learning, and in some sense a « holy grail » in the journey to AI. The proposed model exhibits impressive one-shot (or low-shot) performance on the Omniglot dataset compared to a benchmark LSTM model. It would be useful to compare to other models that have been previously proposed for this task (or state that they are much worse than LSTM). The regression experiment and comparison with Gaussian processes feels a little like an afterthought, and as such is somewhat disappointing. A more thorough evaluation, perhaps with inclusion of some UCI regression datasets, would paint a more convincing picture of the model’s generalization ability in this case. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - Line 190: should D run from 1 to t-1, and not to t? It looks like we are trying to predict the current y given all the previous data points. - Line 245: « be able use » ==> « be able to use » - Equation 1: if a cosine similarity is used, K can lie between -1 and +1. In this case, it seems that the softmax in eq. 3 is not very discriminating to retrieve very similar vectors (K ~= 1), versus unrelated ones (K ~= 0), since exp 0 = 1, and exp 1 = 2.718. In other words, as formulated, the ultimate read-weight difference does between unrelated and related memory entries does not seem very high. The original NTM introduces a \beta multiplier in this softmax, which can make the results much peakier if needed. The lack of this « weight sharpener » should be discussed. - Equation 5: it is not immediately clear that the weights in w_t^u should sum to one. - Section 3.2: the whole section should give more intuition as to how the LRUA module operates, and in particular why the sentence (lines 344-346) is true, i.e. that only two memory slots can be written and not the whole memory. This is presented as a key contribution, but not explained clearly. - Figure 2: the y labels should really be « Fraction Correct », or the tick labels should be multiplied by 100. - Line 534: explain what a Dual LSTM is and give citation. - Section 4.2.2: what is the episode length for curriculum training? - Section 4.3: what kernel was used to generate the data? It looks very different than the one used to make predictions. - Section 4.3: is there also an episodic setup to this task, and if so what is the setup, and Figure 5 is plotted after how many episodes? It seems that episodes would be necessary to learn a kind of « implicit kernel » for a given task — this is an object that would be interesting to study in its own right. - Figure 5: how are the confidence bands computed in the MANN model? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper focuses on a neural recurrent architecture for meta-learning. The idea is based on the work by Hochreiter et al. 2001 where one wants to learn a sequential model (a RNN here) that get at time t an input x_t and the label of the preceding input, and has to produce a prediction. The proposed model is based on a recurrent neural network with a memory module close to the memory module proposed in Neural Turing Machines. The difference is in the way the information is written in this module (called LRUA) : this module writes memories to either the least used memory location or the most recently used memory location Experiments are made on both a one-shot classification problem using the Omniglot model, but also on a regression problem using functions generated by Gaussian processes. The experimental results are very nice and interesting, showing that this model is able to adapt its behavior to small sequences of inputs. Clarity - Justification: I find that the paper could greatly be improved since some parts of the model are really not clear. Many notations are not defined and the paper lacks of details For example, the way the k_t are computed is not explained. The description of the model is clearly too short. The way the model is trained is also not clear (using cross-validation ? What are the values of the hyper-parameters ? What is the size of the memory ? ) The way categories are encoded is not well explained – particularly the choice of using sequences of length 5 with 5 possible characters. Why not using simple binary codes for example ? All these lack of explanations make the results very difficult to reproduce. If we get the general idea of the paper, the authors are 'hiding' many relevant information to really understand how it is done. Significance - Justification: I really like the idea proposed in this paper. The basis of the paper is not new but the authors have clearly made a good work in adpating this idea with modern neural network architectures. The fact that the paper is using a particularly well adapted memory module can be criticized but the core of the article is very interesting. Concerning the experimental part, the results are nice and the model shows impressive performance. I really liked the fact that the model has been also compared to the performance of humans. Since the omniglot dataset is quite specific, it would be interesting to know the ability of the model to deal with more complex images like ImageNet for example. The use of curriculum learning does not bring relevant information because this approach is based on a well tuned heuristic and I think that this part could be removed – for example to add more details concerning the architecture and experimental setting. Another problem concerns the comparison with other models. The authors compare this model to simpler version of the same model, and they don't use existing one-shot classification models; this could be improved. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As explained before, the paper proposes an interesting contribution to the field, but the way the paper is written and the lack of details is a big problem since one is not able to reproduce this model. The contribution will have to be reinforced by using a less specific memory module since, as it is usually the case in many works on Memory NN, the architecture of this module is perhaps the main reason of the performance of the model. The authors have to provide elements about how strong this particular choice is. Pros : * General idea * Experimental Results Cons : * Lack of details * The categories are encoded is not clear * Lack of comparison with existing models * Very specific architecture for the memory module and no experiments that measure the impact of this choice =====