Paper ID: 76 Title: Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper looks at using CTC for speech recognition to predict characters. They also introduce a HPC technique to speed up training Clarity - Justification: My main issue with the paper is that a lot of it draws on existing work, but a lot of this work is cited superficially. The authors need to make clearer what the past work is and how their approach differs. Also, why do you call this end-to-end learning if you have a separate LM and AM? End to end is really acoustics to words in 1 model. Significance - Justification: This paper is very interesting but provides very little in terms of novelty. Most of the techniques explored (CTC, BatchNorm, SortaGrad, GRUs, Frequency Convolution) have all been tried before in the literature. Furthermore, the authors should do a better job of citing a lot of this related work and clearly explaining the novelty of their own contributions. The one novelty I see is the lookahead convolution. Right now the paper just reads as sticking together a bunch of previous techniques that have worked already. In addition, the results compared to human listeners seem biased and skewed - for example Human intelligibility on most clean speech is around 1% WER. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I. Intro a. the CTC reference that belongs is either Alex Graves or Hasim Sak's papers, not Miao (that was a much newer paper based off the previous two) b. You should give the \% rel improvement in the intro, don't just say "better" II. Related work a. In the speedups for training, there are two papers worth mentioning that have also looked at speedups * T. N. Sainath, I. Chung, B. Ramabhadran, M. Picheny, J. Gunnels, B. Kingsbury, G. Saon, V. Austel and U. Chaudhari, "Parallel Deep Neural Network Training for LVCSR using Blue Gene/Q," in Proc. Interspeech, September 2014. * Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu, 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, in Interspeech 2014, September 2014. b. H. Sak CTC paper from Interspeech 2015 also looks at data augmentation to get CTC to work and should be mentioned as well III. Model Architecture a. how is your architecture different than H. Sak's CTC papers, except that now you predict characters rather than phones? b. Is this really an RNN, or an LSTM, that should be clarified. c. It's not surprising your initial batchnorm idea didnt work, in speech we estimate statistics long windows like an utterance. Why would estimating mean/var stats on a small mini-batch necessarily help? Your sequence idea is much more like utterance mean/var norm in speech, so you should reference that as motivation d. I believe Sortagrad type techniques (giving the network easier instances before harder ones) has been tried before, or at least in other learning algorithsm. Pls cite the necessary references. e. Why do you report convolution experiments on a different dataset, this makes it really difficult to follow. g. If your model predicts characters, you learn pronunciations from the training data only. How does your model work on unseen pronunciations, particularly for live data and not toy datasets? 4. a. Have you tried 1-bit SGD to parallelize across GPUs (see F. Seide's reference above) b. There has been a lot of GPU parallelization work in the literature, the authors should do a better job clearly detailing how their approach is novel and different from what already exists. 5. a. its not surprising that increasing training data helps for CTC (this was shown in H. Sak's paper). I also think since you are learning pronunciations, this would help a lot, so some more discussion and analysis around the scaling data would be nice. b. The results compared to Human listeners seem skewed, especially because there are just two listeners. A comparison to other machine learning techniques in the literature (for example a non-CTC model to see how much benefit you are getting, would have been nice) ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper builds on the "Deep Speech" work at Baidu. Compared to the previous work, this paper has several novelties or advances. It demonstrates that the end-to-end paradigm can be effectively applied to Chinese as well as to English. It shows that a variant of batch normalization for recurrent neural networks first proposed in (Laurent et al., 2015) accelerates training of deep RNNs and, in these tests, also improves generalization. It proposes a form of curriculum learning for CTC-trained networks that improves training behavior in the early stages. It shows that using gated recurrent units provides improvements over vanilla RNNs. It shows that having several layers of frequency-domain convolutions substantially improves performance on noisy speech. It proposes a "lookahead convolution" layer that allows unidirectional RNNs to make predictions based on a limited amount of lookahead. It describes a number of optimizations to speed up training of the deep RNNs, including an optimized All-Reduce operation, custom memory allocation routines, and an optimized GPU implementation of the CTC criterion. Tests of the models and comparisons against crowdsourced human transcripts show that under certain circumstances the trained models can outperform non-expert annotators. The paper also describes the software infrastructure needed to deploy such models for large-scale speech applications. Clarity - Justification: The paper is admirably clear and well written. The only clarity issue I see is that the captions for Tables 1, 2, 3, and 5 should identify the test material. At a bare minimum, the captions should identify the language, and if the material comes from a standard collection such as CHiME, this should also be stated. Significance - Justification: This is an excellent applications paper which brings together many ideas useful for designing, training, and deploying large-scale end-to-end speech recognition systems. However, there is not a lot of scientific novelty to the paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is more of an applications paper than a scientific paper, but it is clear, well-written, and documents a large number of methods that are useful for training large-scale end-to-end speech recognition systems. It definitely belongs in ICML. In addition to making the captions clearer about what data is being used, there are a few more changes that should be made to the paper to improve it. The ICML paper template should be used. "larger datasets than what is typically used to train" -> "larger datasets than what are typically used to train" In Section 3.1, when you discuss the alternate sequence-wise batch normalization, you should make it clearer that this form of batch normalization was actually proposed in (Laurent et al., 2015). In Section 3.4, you need a pointer to Table 2 such as "In Table 2 we report results on two datasets...". "to three layers of 2D convolution improves WER by 23.9% on the noisy development set." -> "to three layers of 2D convolution improves WER by 23.9% relative on the noisy development set." "out of vocab characters" -> "out of vocabulary characters" There are problems in the references: The Barker et al. paper is no longer submitted to ASRU 2015. It has been accepted. The Chetlur et al. paper needs a year. Several references have words in their titles that should be capitalized but are not. You need to protect those in your BiBTeX file with curly brace. Several references have garbage characters that are probably caused by accented characters being pasted into a file. Please fix them. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a large-scale implementation of recurrent network based speech recognition, using a deep system that maps from log spectral features to characters. The reported accuracies are very good, although direct comparisons with the state-of-the-art are not really possible since the training data is not known. The implementation details are presented quite thoroughly. The novelty of the paper is through scaling up and implementation rather than any methodological advance. Clarity - Justification: The paper is clearly written, and the implementation details are good. The experimental details are much weaker. No real information on the training set is given; language models are mentioned in passing, but no information whatsoever is given. Significance - Justification: This paper is significant in the way it scales up RNN-based speech recognition, and the reported accuracies are very good. One of the major claims to significance in this paper is the comparison with human performance. I do not find this convincing. Lippmann (R Lippmann (1997) "Speech recognition by machines and humans", Speech Communication, 22:1-15) reported in detail on comparison between human and machine performance for speech recognition. Human error rates on the various WSJ test sets varied between 1-2% word error rate, with results of less than 0.5% if a committee was used. The human error rates for WSJ in this paper, collected through a crowdsourcing approach are much higher (5-8%). If the paper is published the abstract, experiments, and conclusions should be modified to take this into account. I am not able to judge if the Mandarin human performance results are accurate or not. The second claim for significance which I believe is over-emphasised starts from the opening sentence "Decades worth of hand-engineered domain knowledge has gone into current state-of-the-art automatic speech recognition (ASR) pipelines". So far as I can tell the only domain knowledge not used in the current paper, compared with the state-of-the-art NN/HMM systems is the pronunciation lexicon and (for Mandarin) specifically dealing with tone. Both of these have been heavily explored in the speech recognition literature, most recently in the IARPA Babel programme. In my opinion this over-claiming of significance detracts from a good an interesting paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In the related work section I think it would be fair cite Lu et al who reported a word-based end-to-end RNN system evaluated on Switchboard: L Lu et al (2015), "A Study of the Recurrent Neural Network Encoder-Decoder for Large Vocabulary Speech Recognition", Interspeech-2015. Data-augmentation has been well-used in robust speech recognition, for example multi-condition training is very well studied, for example as part of the various Aurora evaluations: Hirsch et al (2000), "The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions", Proc ISCA workshop ASR2000. The current work is at much greater scale, but the idea of augmenting data through additive noise at different SNRs is very well studied and should be referenced. As mentioned above the experimental section could have many more details: - I assume a language model was used, but it is only mentioned in passing - please give some details of the models used and the training data - Since much larger training sets are being used than is normal for these test sets, and it is not clear where the data comes from, it would be appropriate to outline what was done to ensure that there is no inadvertent leakage between train and test sets. (this is important since I suppose the training data may include material from after the test set (1987)) If you are using a language model is this really an end-to-end system. How is it "more end-to-end" than a grapheme based HMM/NN system? =====