Paper ID: 998 Title: Sequence to Sequence Training of CTC-RNNs with Partial Windowing Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose an algorithm for online training of CTC. The main idea of the paper is to use truncated backprop through RNNs for learning gradients. However truncated backprop is not possible for the CTC loss function itself, because the gradient of the loss function cannot be computed until the full output sequence is seen. So, the authors propose to maximize the equation (7) instead. This equation maximizes the expected log prob over all left alignments of the target sequence to the partial input. In this process it ignores the contributions over the betas from the future, and leads to a loss function that can be optimized. The authors apply this method and show that with some pretraining, this loss function can be used for online training. However, the models trained as a result perform worse than the full models. Clarity - Justification: Figures were not very accessible - Figure 4 for example has no indication of where the h and h' and tau_n and tau'_n variables are for a given sequence. And no caption is provided here that would help reading the figure. Figures 2 and 4 are more or less the same with slight differences. Significance - Justification: Its not clear to me why online training is useful in the context of CTC. The authors propose that one of the reasons is parallel training, which makes sense to me. However, it is possible to do parallel training in many other engineering ways such as bucketing of sequences by length etc, which get the job done without sacrificing any properties of the models. The authors should have presented some evidence that the current method does much better (speed-wise) than simple bucketing strategies that can be applied. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As mentioned above, I'm not sold on the need for online training. It seems to produce worse results for both WSJ and TIMIT, and the paper doesn't provide metrics in speed improvements over engineering solutions that would justify the use of this method. It is not surprising in Table 1 that larger number of streams are processed faster - GPUs are use more efficiently in matrix multiplies for recurrent multiplications, since the batch size is larger. However it is nice that the loss of accuracy is not that great - compared row 1 (its a lot worse compared to the non-online baselines, right ?). The authors don't really provide much motivation over why equation 7 is a good thing to optimize. Wrt line 440 -> I think that the assertion alpha(tau,m) is a posterior probability p(z_{1:m} | x_1..\tau) is not very useful, because its not really a posterior over the partial output sequence given full data, or the posterior over the full output sequence given partial data... Consider if m=0, tau=T-2, and T=10, maximizing a sum which includes the term p(z_0 | x_1..(T-2)) is not very useful, since this partial alignment will not give rise to ANY full alignment of the output sequence - the remaining output tokens (=10-1) cannot be produced in 2 frames. So maximizing over those cannot be useful. The authors show no results for CTC-EM from scratch; does it train at all ? Moreover CTC-TR-2048 seems to be the best network, and CTC-TR + CTC-EM networks seem worse (figure 7). Given that CTC-TR is nothing more than ignoring frames that have gone out of context (please correct me if I'm wrong in this assessment), and it achieves the parallelism over examples, it appears that the novelty of CTC-EM is not very useful. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors presented an variation of CTC (Graves et., al 2006) that can be trained online, but perhaps more importantly, inference can be run online. Two main techniques were presented Truncation and Expectation-Maximization (EM). The experiments were ran on WSJ and TIMIT. On WSJ, the authors should cite and include numbers of comparison in Table 1 (even though they are prior offline work, a point of comparison is important): 1] Bahdanau et. al. 2015 (End-to-end attention-based large vocabulary speech recognition) 2] Graves et. al., 2014 (Towards End-to-End Speech Recognition with Recurrent Neural Networks) Both of these papers had significantly lower WER, 18.6 (Bahdanau) and 30.1 (Graves) w/o LMs as well (compared to 38 in the authors work). However, once again, these models are not online. Clarity - Justification: Paper is clear enough. Significance - Justification: The authors presented an online variation of the CTC algorithm introduced by Graves et., al (2006). The performance of the model is quite poor compared to the original WSJ CTC paper (Graves et., al 2014), the WER is 30.1 vs. 38, or compared to offline attention models the WER is 18.6 (Bahdanau et. al. 2015). Similarly, on TIMIT, compared to online attention models (which the author's did not cite or compare against) the PER is 20.1 (author) vs. 18.2 (Jaitly et. al., 2015). The ability to train online has questionable value as speech utterances are not so long they can't fit onto a modern GPU. The authors argued wasted memory or compute; counter arguement you can use bucketing (see Chan et., al 2015; Listen Attend and Spell) However, there is significant value to inference online (i.e., real-life production models of ASR models do need to be online per latency requirements for many applications). However, the results of the online-CTC models in terms of WER and PER are quite poor, and still an order of magnitude behind HMM hybrid models (WSJ is at 3.5 WER for DNN-HMM, and TIMIT is at around 16.5 PER). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The empirical evaluation performance (WSJ/TIMIT) are not very convincing. We are sacrificing so much performance to switch from an offline model to an online model. While the online aspect is enticing, if we really did care about online training. This reviewer feels the contribution is a bit lacking, we take the existing CTC algorithm and make it online, however, the model suffers an 8% absolute drop in WER on WSJ task, additionally compared to online attention models (Jaitly et. al., 2015) on TIMIT the model is 2% absolute PER behind. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Truncated backpropagation through time BPTT(h) of Williams & Peng (1990) with unroll size h is combined with CTC (ICML 2006), using an EM algorithm. The authors claim that the modification allows CTC training in online settings. On the WSJ corpus, under constrained memory capacity this achieves significant speedup on a GPU without much loss of performance. Clarity - Justification: The paper is clearly written and indicates the novel contributions. However, the experiments section is a bit hard to follow, and it might be not so easy to reproduce the results. Significance - Justification: Very interesting combination of BPTT(h) and CTC, very interesting results. However, although the sequence can be longer than the unroll size (potentially infinite input length), the alignment is already reasonably initialised during pre-training with CTC-TR and uses the same amount of unrolling. In this setting, the length of utterances is not entirely unknown. Figure 7 shows that CTC-EM helps to train with smaller unroll lengths. Similar performance needs more frames. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is interesting. However, it should probably not emphasise the online learning, but the possibility of using shorter length without decreasing performance. This will be much more efficient on GPUs and reduce memory requirements. In the experimental setting chosen by the authors, the length of the utterances is not entirely unknown. What happens with different unroll sizes for pre-training (with CTC-TR) and fine-tuning (with CTC-EM)? A few comments on missing relevant work: The authors are actually not using the original LSTM (1997) but LSTM with forget gates: Gers, F. A., Schraudolph, N., and Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3:115–143. And the first substantial application of LSTM and CTC to speech recognition was actually published much earlier: Fernandez, S., Graves, A., and Schmidhuber, J. (2007). An application of recurrent neural networks to discriminative keyword spotting. In Proc. ICANN (2), pages 220–229. =====