We thank the reviewers for their time and feedback. The writing and citation suggestions are appreciated and will be incorporated into a revised version.$ We have fixed a minor typo in the objective function removing a needless normalization, pointed out by reviewer 5: (\lambda \sum_i \kappa_i \mathbb{I}_{c_i=m_t} + (1-\lambda)\mu r_{m_t}) ===Assigned_Reviewer_1=== * Novelty: The reviewer says that the convolutional attention mechanism is not novel. We disagree strongly. To our knowledge, no one has previously used a convolutional network to compute attention weights. It is entirely normal for a publication to describe a novel combination of existing ideas, when it can be shown that the combination has value. We show that the convolutional attention mechanism leads to a significant improvement in practice over RNN-based attention models. * Lambda: It may seem that \lambda is a simple linear interpolation, but actually it is much more flexible and powerful than that. In fact, the \lambda value is actually the output of a convolutional network (see line 4 of the copy_attention pseudocode), so the weighting factor is computed based on complex learned features of the input. * The two mechanisms (\alpha and \kappa) are motivated by the need to capture different kinds of features to predict copied vs non-copied tokens. And indeed they learn completely different features as it can be seen in Figures 2 and 3. ===Assigned_Reviewer_4=== * We plan to release the code once the paper is de-anonymized. ===Assigned_Reviewer_5=== * BiRNN vs convolutional attention: This is an interesting question. Although we do not have a definite answer, we believe that the difference of the performance between the BiRNN and our model can be attributed to many potential reasons, some of which are mentioned in 3.1. This includes much longer input sequences and relatively smaller datasets in our problem, both of which would hurt an RNN/LSTM. Another point is that although the weight sharing in an RNN does provide some translation invariance, the entire point of having a time series model is to avoid translation invariance: we want the hidden state to represent the entire sequence to that point. After all, if this did not happen in practice, sequence-to-sequence encoders would not work. * Use cases: The reviewer asks where an extreme summarization system would be used in “day to day programming”. Actually, there are many applications of this within software engineering, especially because our trained network can summarize an arbitrary snippet of code (doesn’t need to be a method body). These include: (i) tools for code review, i.e., suggest to a programmer methods that can be named more clearly. (Our studies in software engineering show that professional developers care very much about this.); (ii) Code search: Our work produces a code representation that could be used within a search engine; (iii) Better deobfuscation tools: As the reviewer suggests, there is already a little bit of work in this area, namely, “Predicting program properties from Big Code” by Reychev et al. POPL 2015. * Padding: Zero padding is added only in the first layer, in such a way that the last layers \alpha and \kappa have exactly as many components as the input code tokens. Each layer gradually trims the input. Having said that, we could have used “smaller” zero padding before each convolution layer instead of a “larger” padding only on before the first layer. * We do not compare to “Suggesting Accurate Method and Class Names” since that work used information from the code (e.g. method signature) that is not available for this task. Adding those features, would improve the performance of the system, but solves a less general problem. * We will add a figure to show how the system performance compares with the size of the training data. * The dimension of the GRU is k_2=16. This is indeed small but necessary for our medium-size dataset; larger sizes did not work as well. The K_\lambda has the same size as K_att and K_copy. We will clarify in the text. Note that k_2 was tuned during our hyperparameter optimization. * Using a character-level model might be impractical in a software engineering context, because incorporating knowledge about tokens into the model can be seen as a way of adding prior information about the task. This might be the reason that NL MT models don't widely use character-level models either. However, we believe that it would be interesting to combine a character-level model with our current approach in the future to assist with novel identifier names.