We thank the reviewers for their thoughtful comments. We will address all suggestions in the final version.$ Reviewer 1: Q: Why does the PHOG not suffer from overfitting more than n-grams or PCFGs? A: To avoid overfitting we add regularization that penalizes longer and more complicated programs (as mentioned in section 4.3). As a result, our programs contain 2-4 write and 2-8 move instructions, a significantly shorter context than one obtained from a 10-gram language model. Further, we use smoothing as in language models. Q: log-bilinear parameterization of n-grams is often better than count-based variants. It would be good to add this as a baseline and consider log-bilinear parameterizations for the PHOG. A: Good point. Indeed, as shown by Maddison & Tarlow, 2014 such models slightly improve the log-probability over basic n-grams also in the domain of modelling source code. Although they have greater training and query time, it is an interesting future extension to make a log-bilinear variant of PHOG (e.g., by combining multiple programs with learned weights). Reviewer 2: Q: Given that the technical challenge in this work is learning the conditioning functions, it would be helpful to provide more details in the supplementary material. A: Thank you for the suggestion. We will include more details on the challenges to learn them. Q: In what respect the proposed grammar is "higher order"? Does it refer to the fact that the context is constructed using functions that are themselves defined in terms of a grammar? A: This is correct -- we use the term higher-order to mean that the prediction function takes as input another (TCOND) program/function that we learned. We are happy to change the name if it is confusing. Q: Def. 3.3: The notation α[γ] → β is ambiguous, does it mean that α[γ] is a single production rule that results in children β and that there may be multiple production rules α[γ] → β with different β in R? A: Yes, this is correct. Reviewer 3: Q: Although the new probabilistic framework is incredibly general (a HOG can simulate a Turing machine), I’m concerned that it would be less useful in future work if it does not restrict the class of languages in some way. A: Restrictions on the DSL can make the grammar to be less expressive than a full Turing machine, e.g., our DSL has no loops. Q: What is the smoothing method used for the n-gram model and were other methods tried? A: Both n-gram and PHOG use Witten-Bell smoothing to allow direct a comparison. We tried other smoothing methods including modified Kneser-Ney. Even though this method improved the log-probability slightly, it did not lead to improvement in overall accuracy. Q: Is the backoff simply in the order that was output by the TCOND? A: Yes. Q: What are the learned contexts by TCOND programs? A: The learned context often includes traversing to one or more positions in the AST that are similar to the position being predicted. For example: -- for keys in a JSON object (i.e., a dictionary) it is the position of the key (denoting how many other keys were already defined) and previously defined key -- for strings, the learned context includes two previous string values seen in the program -- for array index, it is the index used previously. Q: To what extend are TCOND programs dependent on the structure of the tree? Can a small change to the code break the program? A: The learned programs try find a context that works well for the most common tree structures seen in the training data. For example, the program learned for predicting APIs works well for the most common method invocation type ‘object.method(...)’ but worse for chained invocation ‘object.method().method(...)’ which is rare. To address the above problem, an interesting future work is to extend the TCond language with branches to allow selecting appropriate programs depending on the current context. Q: What are the empty terminals and where do they come from? A: These values are generated using the Acorn parser and other parsers might replace them with small number of constant values denoting the type of parent non-terminal such as ‘if’, ‘for’ and others. That is, predicting them is mostly trivial for all of the models considered in our work. Exceptions are continue and break statements where the value is mostly empty but can also include optional jump target. Q: It is typical to run a language models over the terminal symbols only, how does such a model compare to the AST models you present? A: Based on your question, we tried such a model on our dataset. The model can be easily simulated by conditioning on the previous n terminal values except that when a terminal value is empty, we condition on the parent non-terminal. Such a model slightly improves the overall accuracy over the n-gram models shown in Table 1 (by 2%). However, it still has worse precision than PHOG (by 8%).