Dear respected reviewers, $ Thank you very much for the constructive and inspiring reviews. Thanks to reviewer #4 for parsing the key technical contributions that make this system work well -- we will do a better job in the final version separating these out. The main theme in the reviews is that it is unclear how the model is end-to-end (the removal of phonemes is not properly motivated) and there are questions about the comparisons to human. (Thanks to reviewers #5 and #3 for thoughtful comments on these issues.) "End to end" refers to the training pipeline more than the transcription task itself. Much prior work handles graphemes within the HMM framework and thus they need to solve many of the same engineering challenges as the standard phoneme-based systems (e.g., modeling context-dependent states as "polygraphs"). The "end to end" vision is not merely about avoiding phoneme representations but about removing the bootstrapping/alignment/clustering/HMM machinery often used to build HMM-based ASR models, and thereby simplifying the training process significantly. By predicting graphemes directly (without an explicit pronunciation dictionary), the model has some trouble transcribing unseen pronunciations. The effect of this is shown in table 4, where we do noticeably worse on the accented English datasets where training data is unavailable. On the other hand, in the presence of training data matching the distribution of the test traffic, we do remarkably well without a pronunciation dictionary. The Mandarin results in section 6.2 highlight this: Our model performs better even than a committee of native speakers in transcribing the utterances. Every clip in table 4 is transcribed by 2 people, which we think of as a reasonable competing "ASR wizard-of-OZ" that we should strive to outperform (though we realize this will not be as accurate as dedicated, trained transcriptionists.) Many of the errors in WSJ-Eval92 set we noticed were mis-spelling proper nouns like "noboru takeshita" (ground truth) vs "nuburu takashida" (human 1) or "noburu takasheida (human 2)" and few transcription ambiguities like "we've" and "we have". Our engine tends to make similar errors, so the same bias may be present in both test numbers. We will do some more analysis and clarify these results in the final version. Some notes on details: “b. Is this really an RNN, or an LSTM, that should be clarified.” Except when specified otherwise all recurrent layers are vanilla RNNs with ReLU activations. There are no experiments in the paper using LSTM cells. We will update references to "recurrent layers" to be explicit about this. The convolution experiments are done on the same training and dev set (which we refer to as the Regular devset in table 2). The first model in table 2, which achieves 9.52 WER on the dev set, is the same as fourth row in table 1. Thus table 2 is exploring a slice of table 1 along the "convolutions" dimension. We will make this more explicit in the description of table 2. Thank you very much for pointing us to the additional references, we found them all highly relevant and we will add them to the final version and discuss them thoroughly. We will make sure to update the final version according to the valuable and detailed suggestions such as “a. the CTC reference that belongs is either Alex Graves or Hasim Sak's papers, not Miao (that was a much newer paper based off the previous two) b. You should give the \% rel improvement in the intro, don't just say ‘better’ ” “In my opinion this over-claiming of significance detracts from a good an interesting paper.” Thanks for pointing this out. We will make sure we fix it in the final version. We will also make more implementation details available through a public arXiv paper.