We’d like to extend our thanks to the three reviewers for their excitement about the core ideas proposed in the paper. $
R1 + R2 + R3: Lack of details. 

This was a primary concern across reviewers, and with hindsight, we entirely agree with this assessment. Details will be provided in the main text and in a substantial supplemental information section to make the results reproducible and the paper more self-contained. We direct you to a section at the bottom, where we integrate similar queries from all reviewers.

R1 + R3: Lack of comparison with existing models. 

We have: LSTMs and dual-layered LSTMs, feedforward networks, and kNN classifiers trained both with and without learned features. Other neural network implementations trained with pixel-based inputs would invariably suffer from massive over-fitting from episode-to-episode, and/or would require tricky ensemble methods. We are unaware of any other architectures that could serve as fair comparisons without significant modification. Other external-memory based neural network architectures may be capable of meta-learning; e.g., sufficiently altered “MemNets” (Weston et al. 2014). This is a message we hoped to get across, and was our motivation for placing our NTM-like architecture under a broader class of “MANNs”. 

R1: Very specific architecture for the memory module. 

The choice of LRU access was to make the architecture cleaner. The “Combined” access module (the original NTM implementation) also uses location-based addressing, and explaining the use (or uselessness) of this addressing in our regime may have obfuscated the paper’s message. Although we did work to streamline the memory module, with LRU out-performing the Combined module (table 2), we’d like to slightly de-emphasize its specific role and/or power, as the Combined module still performs much better than all other baselines. We’d like to re-emphasize the core idea, that deep neural networks equipped with external memory modules seem generally capable of meta-learning, which was a previously unexplored notion, and one that produces impressive results. 

R2: Encoding meta-learning behavior a priori. 

We believe that the results are more powerful given that the architecture had to learn this strategy as the task solution. We believe it would be very difficult to tune the parameters a prior to induce this behavior. 

R3: Read weights. 

We agree the equation is not very discriminating. We tried euclidean distances for the tasks from Graves et al., and found that it exhibited worse performance. Weight sharpening was not used (temperature set to 1), and tuning of this parameter may improve performance.


R1 + R2 + R3: Model and training details.
 
Hyperparameters. The choice of these hyperparameters is not critical. Many iterations were tried, and the meta-learning effect is robust, which speaks to the power of this approach. RMSprop with learning rate=1e-4, max learning rate=5e-1; memory size: 128 slots (R2: since this is accessed at each step, it does not scale to larger memory sizes. This is true, and is an entirely non-trivial problem); memory slot length: 40; LSTM controller hidden units: 200; number of read and write heads: 4 (R3: this is hard-coded, and many low-number values work here); episode lengths: 10 times the number of classes per episode; LRU decay parameter gamma=0.99. We will further expand on model details in supplementary materials.

Use of string labels. For one-hot vector labels, the LSTM controller learns weights to >3000 units. This adds an extra layer of difficulty to training that is avoided by using a combinatorial approach.  By demonstrating that combinatorial labels can be easily used, we set the scene for scaling up to experiments where thousands of classes are remembered at once. Additionally, strings allow for future exploration of “hierarchical classes,” wherein subclasses of a particular high-order class share common string elements (i.e., “aa___”). 

Kernel. For generating the regression data, we used the same RBF-like kernel with noise that was used by the GP baseline for predictions. It is not clear if the kernel was implicitly learned by our model, since it only output parameters (mu and sigma) for a predictive Gaussian distribution.

Regression. It is also an episode-based setup. The model outputs mu and sigma values for a predictive Gaussian distribution, with the sigma value being plotted as the confidence band in the figures. The log likelihood is computed using these mu and sigma values and the “true” value of the underlying function. 
 
Other. “No further learning” refers to a cessation of gradient updates.  The LSTM and FF do show small signs of “educated guessing”. Since the correct labels are being provided as time-offset inputs, these networks may do some elementary book-keeping of already-seen labels during an episode.