Paper ID: 971
Title: A Convolutional Attention Network for Extreme Summarization of Source Code

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a convolutional attentional network to perform extreme summarization of source code to create short and descriptive summaries. They demonstrate the effectiveness of the proposed method on popular java projects, and perform quantitative and qualitative experiments.

Clarity - Justification: Well-organized and clearly written paper. I encourage authors to consider to share their source code for reproducibility.

Significance - Justification: This is a theoretically sound paper, and introduces a new model that contributes to not only machine learning community but also software engineering community.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper is a well-written and interesting paper. Authors introduce a neural convolutional attention model for extreme summarization of source code. The experimental section is detailed and satisfying, including both quantitative and qualitative sections. Overall, this paper seems to contribute well to machine learning community.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a deep learning model for creating “extreme summaries” of code (which are essentially just method names) based on method body.  The basic idea is to use an RNN architecture to generate tokens (or subtokens) of the output summary one by one based on embeddings of the input tokens.  At each step, an attention mechanism is used to combine the embeddings (and hopefully focus on the tokens that are most relevant).  What makes this different from the “standard” attention model is that instead of using a bidirectional RNN on the input embeddings, this paper uses convolutional features which are translation invariant.  This proposed method ignores code function but is good enough to handle identifier names (and even to handle identifier names that aren’t already in the training set vocabulary).  Impressively, the authors apply this to real life github repositories and show that the convolutional attention mechanisms are indeed better than the standard approach. 

Clarity - Justification: Overall the paper is well written and was a pleasure to read. 

Significance - Justification: Dealing with out-of-vocabulary names (even in-vocabulary names) is quite a hard challenge which many previous works tend to punt on (e.g. but anonymizing or assigning `var1’, `var2’, etc), but this paper takes a pretty nice approach.  Results show a clear improvement over standard attention, but are modest compared to tf-idf baseline.  What’s missing is a strong motivation of why we would expect convolutional attention to be better than the BRNN baseline since BRNNs also have some sort of translation invariance from weight sharing across time?  And why would BRNNs be so much worse than tfidf?  I found the dataset to be very interesting as well as some of the examples that the authors showed, however it’s still unclear to me where I would actually use such an “extreme summarization” system in my day-to-day programming.  It might be fun to try to learn something like this but from disassembled code instead of code that was already written to be human readable.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Some other miscellaneous comments: + Line244: don’t we need to zero-pad again at the second convolution?   And is there any reason not to have a Relu after second convolution?  Also I wonder if batch normalization would help (instead of L2 normalization). + Line 310: Should call attention_features(…) here to get L_feat? + Line 355: “marginalizing over the copy probability” is somewhat confusing phrasing because the probability is not itself what is being marginalized over.  Related to this, what is the size of K_lambda (the convolution kernel for the lambda)” + Line 367: not clear to me why we need to divide by sum(kappa) since kappa is the output of a Softmax layer. + A question in my mind while reading this paper was: How big should a project be for this proposed system to be effective? It sounds like one must train on projects independently of each other.  However, it’s also clear that if a project is too small, we won’t be able to learn anything.  It would be nice to know how many lines (for example) one needs to have enough signal. + Is the GRU dimension only =8?  This seems super small compared to many translation papers. + The authors cite their previous paper “Comparison to: Suggesting Accurate Method and Class Names?”, which (based on a superficial reading of the abstract and intro) seem to do some similar things to this paper.  Is there any reason why there was no experimental comparison? + I wonder if the authors have looked at character-level prediction (which could also help with out-of-vocabulary issues).  A relevant paper is: “Visualizing and Understanding Recurrent Networks” by Karpathy et al (who model the source code of the linux kernel character by character). 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes a convolutional neural attention model for the task of extreme summarization of source code. The task aims to predict a short and descriptive name (e.g., method name) from a given piece of source code (e.g., method body). This is modeled as a sequence prediction task. The authors test the proposed approach on 10 popular Java projects and compare it with other baseline approaches.

Clarity - Justification: Please see detailed comments below. 

Significance - Justification: Please see detailed comments below.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is tackling an interesting problem, which is to generate highly concise method names from given code snippets. The solution to this problem could bring potential benefits to the Software Engineering community to assist with code understanding and code search.   While the task itself is very meaningful, the described approach appears to be not novel. Components for calculating the convolution and attention weights have been described in other works. It seems Section 2.3 (Copy Convolutional Attention Model) is somewhat a contribution of this work. It describes a linear interpolation of two attention mechanisms, one is used to predict the next token, the other aims to copy a token as-is into the summary. My concern with this approach is that using a simple balancing factor \lambda to control the contribution from both sides appears to be primitive. In addition, the motivation of using the two attention mechanisms and the corresponding objective seem unclear. They could be better justified.  The methodology section is unclear in that some variables are referred to before they are formally introduced (e.g., L_2, E, \lambda). The description contains a mix of pseudo code and text descriptions, which makes it somewhat difficult to follow. Given that there are ample space left to the experiment section, I'd suggest to remove Table 3 (or cut it short) and use the space to clarify details of the proposed methodology.

=====