Paper ID: 208 Title: Learning Simple Algorithms from Examples Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper looks at an application of reinforcement-learning techniques to acquire generalizing sequential behavior from examples. At one level, that's precisely what all RL algorithms do. However, the level of abstraction used and the possibilities for generalization make the work different from mainstream RL applications. Clarity - Justification: Apart from some terminological issues and citation issues, the descriptions of the algorithms and motivations are clear. Significance - Justification: I'm very much of two minds about this paper. I think the way reinforcement learning is being used here is very interesting and important. In many ways, it's closer to what I thought RL would mean when I first started learning about it. From this perspective, I believe that the paper could be of great value in (re-) raising important problems in RL that have not been at the forefront of the community. On the other hand, the results and descriptions feel very anecdotal to me. It's very positive that the authors used a set of related digit manipulation algorithms (instead of just one or two). However, the algorithm is sufficiently complex and parameter laden (and the results sufficiently variable) that my intuition is that it is not possible to make compelling generalizations from them. For the kinds of results claimed in the paper, I'd want to see consistent behavior over maybe 20 or so problems. Thus, my support for the paper is mostly in terms of spurring additional work in the space of learning abstract procedures, not because I feel that the authors have made a strong case for their particular approach. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): "#possible automatas" -> "#possible automata". "Automatas" is not word. "Automaton" is the singular. "Automata" is the plural. "This helps to overcome a non-stationary environment. We are unaware of any prior work that uses Watkins Q(lambda) for this purpose." Please take a look at the work of Loch and Singh on SARSA(lambda), which makes a similar observation. Loch, John, and Satinder P. Singh. "Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes." ICML. 1998. "a reserve task" -> "a reverse task". "does it indirectly by learning a Q-function." -> "does it by learning a Q-function." I suggest changing the wording as "indirect" is a term of art in reinforcement learning that it used to contrast model-based RL from Q-learning. For example, see: Kearns, Michael, and Satinder Singh. "Finite-sample convergence rates for Q-learning and indirect algorithms." Advances in neural information processing systems (1999): 996-1002. "discount factor ... decreases, making the model greedier": I think the term you want is "more myopic". "Greedy" is used to describe action selection, not value determination. "Reinforce" ->"REINFORCE". See: Williams, Ronald J. "On the use of backpropagation in associative reinforcement learning." Neural Networks, 1988., IEEE International Conference on. IEEE, 1988. "difference is performance" -> "difference in performance" "Coping" is used three times in figures where "Copying" is intended. "rnn encoder-decoder" -> "RNN encoder-decoder". "Double q-learning" -> "Double Q-learning". "Playing atari" -> "Playing Atari". "turing" -> "Turing". ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): As clearly stated in the abstract, the paper presents experiments with RNNs learning to perform simple computational tasks. Clarity - Justification: The paper is easy to follow. Certain formatting improvements should be made for the camera ready if any: - avoid )(, use e.g. (#151): (GRU; Cho et al., 2014) - drop the extra parentheses at #667 -> (and in Joulin & Mikolov, 2015) - Figures 1 and 2 contain tiny graphics, the subfigure labels can be aligned, the numbers in Fig. 2 are hard to read. - Align tables and captions, a table caption should be above the table. Significance - Justification: Good presentation, despite quibbles in Section 6. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors are missing some relevant work, some of it decades old. For example, almost 30 years ago, the following supervised RNN learned an algorithm that emulates a Turing machine that does a parenthesis balancing problem. Please cite (and emphasize differences to what the authors did): Williams, Ronald J., and David Zipser. "Experimental analysis of the real-time recurrent learning algorithm." Connection Science 1.1 (1989): 87-111. Furthermore, I think the the authors are actually not using the original LSTM (1997) but LSTM with forget gates: Gers, F. A., Schraudolph, N., and Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3:115–143. The authors write that they use two different sizes of 1-layer LSTM (1997), a gated-recurrent unit (GRU)(2014) and a vanilla feed-forward network. But the forget gates actually are "gated recurrent units." GRU is a variant of such an LSTM. Please correct references. The authors write: "However, most of these approaches use continuous interfaces that permit training via back-propagation of gradients." But there is all this work on RNNs that are R-learned or evolved, often outperforming previous RL machines, e.g.: F. Gomez, J. Schmidhuber, R. Miikkulainen. Accelerated Neural Evolution through Cooperatively Coevolved Synapses. Journal of Machine Learning Research (JMLR), 9:937-965, 2008. The authors write: "Genetic algorithms (Holland, 1992; Goldberg, 1989) also can be considered a form of program induction, but are mostly based on a random search strategy rather than a learned one." I think they mean Genetic Programming: Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. Proc. of an Intl. Conference on Genetic Algorithms and Their Applications, CMU, July 24-26, 1985, Hillsdale NJ. Lawrence Erlbaum Associates. And when they write: "but are mostly based on a random search strategy rather than a learned one", do they mean that learning through evolution is somehow worse than learning by other means? The authors write: "A more relevant work is (Schmidhuber, 2004) which learns an algorithms for the Hanoi tower problem, using a simple form of program induction and incremental learning components." This seems to be an understatement - after all, this is the only method among the cited ones that can incrementally reinforcement-learn general algorithms in an asymptotically optimal way. Section 6 of Schmidhuber's Deep Learning Survey has lots of additional references on algorithm learning through RNNs trained by RL, evolution, gradient descent - check it out! The experiments are interesting. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an methods for learning algorithms via deep net situated in a turing-esque tape context. It presents experiments for learning a few different computations, using various different network structures. Clarity - Justification: The paper is well written and clear. Significance - Justification: The paper works on a area of interest and gives additional experimental insight into regimes where deep nets work. It fails to deliver lasting theories or abstractions. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In general, the area of algorithm learning has seen a tremendous amount of interesting work the past few years, and more work is welcome. The paper is an interesting contribution to the literature, but frustratingly doesn't place it's work in the context of existing, deepnet literature on the topic. In the related work section, the authors cite the Neural Turing Machine, the Stack RNN, Neural DeQue, and End-toEnd Memory, as pieces of related work -- but present comparative results for none. It is confusing to this reviewer as to whether they are omitted becauseas the author notes, they have been shown to successfully learn simple operations such as "copying and sorting". Does this mean that when they are applied to the tape/grid problem they do not succeed? Or are they simply a subcase of a model explored in the paper? Despite this, and without complete context, the paper was quite interesting standing on its own, and while it leaves many questions open is compelling. As an example, the authors do not find a controller/Q-learning combination that is able to learn all of the algorithms, calling into question the nature of the discovered solution. Nonetheless -- the exploration of continuous vs. discrete rewards strategies is interesting. Overall, the paper is interesting, exploring an area of promise and contribution greater understanding -- even if poorly contextualizing the results. =====