Paper ID: 1316
Title: PHOG: Probabilistic Model for Code

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): see detailed comments

Clarity - Justification: see detailed comments

Significance - Justification: see detailed comments

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper describes a new probabilistic grammar to relax the indepedence assumptions made by a PCFG.  It follows the lines of early work on probabilistic grammars for natural language (e.g., Michael Collins's thesis) in that the idea is to annotate the nonterminals with summaries of the tree that has been generated so far.  It is applied to an interesting problem, language modelling for source code.  The framework is incredibly general, more so than even the authors seem to acknowledge. Although the authors point out that the class of HOGs is bigger than the class of context sensitive languages, it is clear that if p is unrestricted, a HOG can simulate a Turing machine fairly easily: p can just run the Turing machine and return the next token in its final output. Even if p is restricted to TCond, a HOG can simulate a TM by defining a grammar over TM execution traces, remembering the TM state and head location in the nonterminal. I would be concerned that the new probabilistic framework would be less useful in future work if it does not restrict the class of languages in some way.  What smoothing method is used for the n-gram model? Is it Witten-Bell as for the PHOGs? Were any other smoothing methods tried? I am quite a bit suprised that Witten-Bell was used as Chen and Goodman found that it has "mediocre performance" compared to, e.g., modified Kneser-Ney.  Also, I did not understand the "smoothing section", because as the context could contain multiple tokens, it is important what the backoff order is. This is not clear a priori for a non n-gram model. This comes up clearly in the old lexicalized PCFG literature, in which baroque smoothing schemes with carefully tuned backoff orders were used. Is the backoff simply in the order that was ouput by the TCOND?  It is unfortunate that no results are presented about what the learned contexts actually *are*. e.g., do they typically find other previous uses of the same nonterminal? What TCOND programs are actually learnt? What are choices of context are learned by the model? What is their average length, etc?  This is especially a concern to me as it seems to me (perhaps unfairly) that TCOND programs would be very brittle to the structure of the tree, e.g., I can imagine a simple TCOND program that would connect the two uses of "defer.promise" in Figure 1, but only a small change to the code in the ellipses (e.g., using another identifier) could break this program. How do the inferred programs get around this?  The discussion of empty terminal symbols in the results in very confusing. Typically these are not produced by programming language parsers. As these account for 45% of terminals, this is worrying. What are these and where do they come from?  n-gram model baseline: It is typical to run a language models over the terminal symbols only, rather than over a linearization of the AST. This is what Hindle et al and Allamanis et al do. How would such a model compare to the AST models you present?  Minor points:  line 772: "Naive Bayes and SVM do not return probabilities at query time" : For Naive Bayes, this statement is false. For SVMs, this situation is more complicated: there are extensions of Platt scaling to multiclass problems, but I am not sure if these are commonly used.  line 270: I think the set "Trees" may be undefined?  line 255: Is this just preorder traversal?  Semantics of HOG: It might be a bit more clear if sigma is characterized informally as the label (terminal or nontermal) of each node in the tree.  Section 2: I think that this this illustartion is a bit abstract unless the context in the example is explicitly defined. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a probabilistic higher order grammar (PHOG) for modeling JavaScript abstract syntax trees. The paper claims that this model is a significant improvement on the standard baselines for modeling code: n-grams and probabilistic context free grammars.

Clarity - Justification: I found the paper very easy to follow.

Significance - Justification: I found the paper to be a somewhat straightforward idea. Still, there's value in carving out the idea, and the experimental results are compelling.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - I found myself wondering why the PHOG does not suffer from overfitting more than n-grams or PCFGs. I would expect that conditioning on the context virtually guarantees that every production rule is unique so that the PHOG would assign 0 probability to every tree in the test set. I think it would be valuable to address that issue directly and discuss why that's not the case.  - log-bilinear parameterizations of n-grams (sometimes called neural language models) is often significantly better than smoothed count-based variants. It would be good to add this as a baseline (and possibly consider log-bilinear parameterizations for the PHOG). 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a generative model for program source code which they refer to as a probabilistic higher order grammar (PHOG). A PHOG generalizes a probabilistic context free grammar (PCFG). In a PCFG a set of production rules is associated with each non-terminal symbol. In a HOG, the production rules depend on both the non-terminal symbol and a context, which is a function of the (partial) tree constructed so far.   Just like in a PCFG, the probabilities associated with production rules in a PHOG can be estimated from data by simple counting. The challenging part in learning a PHOG from data is defining a good context function. To do so, the authors introduce TCOND, a language for programs that extract contextual information via a walk on the abstract syntax tree (AST) of the program. A genetic programming approach, which is unfortunately not discussed in much detail, is then used to perform program induction in the TCOND language to learn a function that extracts contextual information with high predictive power for each symbol in the PHOG.  Evaluation considers the task of predicting deleted nodes in the AST using a large corpus of Javascript examples scraped from Github. Results are compared against a PCFG, an n-gram model, as well as naive Bayes and SVM classifiers. PHOG generally performs well as compared to other methods.     

Clarity - Justification: This is a well-written paper that requires careful reading but is reasonably easy to follow. Clarity could be further improved on a few points (see below), but on the whole this is a strong paper.  

Significance - Justification: While I am not an expert on code prediction problems, the PHOG seems like a useful abstraction that  admits PCFGs, PCSGs and n-gram models as a special case. I'm somewhat amazed that the authors were able to solve the program induction problem for the functions that construct the context –– which would a priori appear to be a highly intractable problem. Empirically the PHOG appears to have good predictive power relative to other models. This paper is in my opinion a clear accept.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - Given that the largest technical challenge in this work is learning the conditioning functions, it is unfortunate that the genetic programming approach used to perform this task is discussed in relatively little detail. While there is relatively little fat to trim in this paper, it would be helpful to provide more details in the supplementary material.    - It would be good if the authors could explain in what respect the proposed grammar is "higher order" in the introduction. I suspect this refers to the fact that the context is constructed using functions that are themselves defined in terms of a grammar, though this use of higher-order seems a bit loose to me.   - After a quick search I learned that "higher order grammar" is apparently an existing term of art in linguistics that has something to do with grammars defined in terms of higher-order logic      ```     https://en.wikipedia.org/wiki/Higher_order_grammar     http://www.ling.ohio-state.edu/~hana/hog/     ```     This HOG, as far as I can tell, is not related to the one under consideration here.  - Definition 3.3: The notation α[γ] → β1...βn is not immediately unambiguous, which can lead to confusion. On an initial reading I took this notation to mean that α[γ] is a list of production rules for context γ (as could also be concluded from Fig 1c). However it later becomes clear from context that α[γ] is in fact a single production rule that results in children β1...βn in the AST (and that there may be multiple production rules α[γ] -> β with different β in R).  

=====