We thank all reviewers for their constructive feedbacks. Here are our answers and explanations.$
Answers to Reviewer #1

Q1: the authors should cite and include numbers of comparison in Table 1 (even though they are prior offline work, a point of comparison is important):
1] Bahdanau et. al. 2015 (End-to-end attention-based large vocabulary speech recognition)
2] Graves et. al., 2014 (Towards End-to-End Speech Recognition with Recurrent Neural Networks)

A1_1: The work of 1]Bahdanau is not about CTC training, so their result is not very related to our paper. They used encoder-decoder model, which shows better WER than CTC models only when there is no external language models (LMs). When an external LM is incorporated, which is usual for speech recognition, our CTC model shows better WER (8.88%) than their models (9.3%).

A1_2: We already included the result of 2]Graves in the text. Also, our model shows comparable WER (8.88%) to their model (8.7%) after incorporating external LM, even if our model is online and their model is offline in inference time.

Q2: Both of these papers had significantly lower WER, 18.6 (Bahdanau) and 30.1 (Graves) w/o LMs as well (compared to 38 in the authors work). However, once again, these models are not online.

A2: The comparison of WERs is very unfair. Bahdanau uses encoder-decoder model (not CTC). It is well known that enc-dec model shows better result than CTC model when there is no external LMs. After incorporating LM, our model shows better WER as responded in A1_1. Also, when an external LM is incorporated, which is usual for speech recognition, our CTC model shows comparable WER (8.88%) to Graves’ model (8.7%).

Comparison with Graves’ model is also unfair since, it uses 5-layer bidirectional LSTM and regularization technique for training. If we use regularization (dropout) for fair comparison, we can achieve 32.5% WER, although our model is “online” and has smaller 3-layer unidirectional LSTM RNN.

Q3: Similarly, on TIMIT, compared to online attention models (which the author's did not cite or compare against) the PER is 20.1 (author) vs. 18.2 (Jaitly et. al., 2015).

A3: As explained above, comparison with encoder-decoder model (Jaitly) without external LM is unfair and out of our paper’s scope. It is well known that encoder-decoder models learn LMs better than CTC model. The important thing is that CTC models show better WER in practical situation (with external LM).


Q4: The ability to train online has questionable value as speech utterances are not so long they can't fit onto a modern GPU. The authors argued wasted memory or compute; counter arguement you can use bucketing (see Chan et., al 2015; Listen Attend and Spell)

A4: This is not true. The main bottleneck of CTC training is GPU memory, even with the most high-end GPUs. Bucketing only reduces wasted frames, not increasing the parallelism. Our paper boosts CTC training more than 5x times by increasing parallelism.

Q5: The results of the online-CTC models in terms of WER and PER are quite poor, and still an order of magnitude behind HMM hybrid models (WSJ is at 3.5 WER for DNN-HMM, and TIMIT is at around 16.5 PER).

A5: Comparison with HMM hybrid models are also unfair. Also, the quoted TIMIT result is measured with LMs. End-to-end models have a lot of advantages (e.g. no lexicon) and shows better results when training data is more than 1000+ hr. 


Answers to Reviewer #4
Q6:  It seems to produce worse results for both WSJ and TIMIT, and the paper doesn't provide metrics in speed improvements over engineering solutions that would justify the use of this method.

A6: Explained above about the worse results. Also, the first row of Table 1 is baseline (for accuracy and speed). We explained existing bucketing technique in A4.

Q7: It is not surprising in Table 1 that larger number of streams are processed faster 

A7: Our contribution is to increase that number of streams. Which is challenging for CTC training due to GPU memory.


Q8: Consider if m=0, tau=T-2, and T=10, maximizing a sum which includes the term p(z_0 | x_1..(T-2)) is not very useful, since this partial alignment will not give rise to ANY full alignment of the output sequence - the remaining output tokens (=10-1) cannot be produced in 2 frames. So maximizing over those cannot be useful. 

A8: This is right. However, the situation occurs very rarely and is ignorable.


Q9: The authors show no results for CTC-EM from scratch; does it train at all ?

A9: Training with CTC-EM from scratch is  possible when unroll amount is over 256. Only initial training is slightly slow.


Answers to Reviewer #5
Thank you for your constructive feedback. We will add the missing references and also “efficient GPU training” more.