Paper ID: 945 Title: Persistent RNNs: Stashing Recurrent Weights On-Chip Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is about the development of computationally efficient computing kernels for GPU implementation of RNNs, in a speech recognition context. Large speed-ups are reported over a baseline implementation using GEMM kernels. Clarity - Justification: The paper gives a good exposition of implementing an RNN on a modern GPU (TitanX) and gives a clear analysis of how the various design decision were arrived at. Significance - Justification: I'm quite uncertain about the significance because of the experimental results which are confusing to me (see below). The computational architecture and the process adopted is interesting and clearly reported, so this work would be of interest to the many people involved in implementing NNs on GPUs. On the other hand there isn't a detailed comparison with the state-of-the-art: comparing with GEMM kernels is a reasonable starting baseline, but there are a lot of groups working on efficient GPU implementations and it would be good to compare with some of that work. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The experimental results are rather confusing. The test data is "Word Error Rate (WER) on an English speaker held 536 out development set which is an internal dataset contain- 537 ing 2048 utterances of primarily read speech.", and the training set is several hundred hours. It is hard to know what state-of-the-art accuracy would be expected given such limited information, but the lowest WER reported is 27%, which seems extremely high. This makes it difficult to understand the experiments and results properly. I have no idea if the efficient computational techniques have an impact on the accuracy. As discussed above, to clearly situate this work with respect to the state-of-the-art there need to be comparisons with other published approaches (including using various NN toolkits as baselines). In related work should be 'pose' not 'poise'. Also when talking about NNs for various applications probably better to say NN since for NLP many of the NNs used are not deep. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a novel approach to computing RNNs on GPUs which takes advantage of weight reuse across time. Clarity - Justification: Very readable paper, although the subject matter might be quite foreign to most ML experts. Significance - Justification: This paper presents reasonably convincing speed improvement results by reusing weights in shared memory and computing a long sequence of RNN in one kernel and small batch size. This type of computation is very tricky to implement well, and critical to many applications of RNN which are compute-bound. Generalizable improvements in that space are very important to anyone working on sequence models. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): * I would like to see a comparison between this approach and Cuddn V5 EA library, which also has its own RNN implementation. * Around line 375: I would like to see if the authors attempt to auto-tune with instruction packing and scheduling. * Around line 411: is this the optimal tile size that is independent of all filter and activation shapes, or is it for a particular group of shapes? * It is interesting that the authors used an analytical performance model instead of real autotuning, or a hybrid approach. It's helpful to elaborate a bit more on that choice. Can the accuracy of the analytical model be verified, given the depths of the GPU pipelines? * Around line 488: a Titan-X GPU only has less than 6 MB of shared memory, I would like to see a solid example of how to factor a layer a few times of that size into this framework and achieve similarly high performance. Naive interpretation of the algorithm seems to suggest that is not trivial. * The global synchronization is an interesting system approach for its purpose. I would like to see more details on how it enforces its correctness, given the weakly-order memory order Cuda has. Does it use fence operations? How does that impact performance? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a GPU algorithm to compute RNN sequences which can be accelerated up to 30x faster on minibatch size of 4. The key main idea is to cache the recurrent weights onto the shared cache (CUDA cache), and thus not needing to hit the device memory (GPU RAM) for higher bandwidth / lower latency. One argument is that lower minibatch sizes are needed to explore larger models (assuming you are using a GPU, and not CPU or some other custom hardware for training). The authors also experimented deep RNNs w/ skip risidual connections (He. et. al., 2015). The results show for equal deep RNNs (48 layers), the skip connections help. However (assuming this is the same train/dev/test split), in the DeepSpeech2 paper they reported much lower WER numbers with a much shallower architecture (suggesting such deep RNNs are not needed). Clarity - Justification: The paper is clear enough. Significance - Justification: The authors claim significant speedup with their algorithm, however, you will be required to use a smaller minibatch (i.e., 4). However, in the authors paper Figure 5, it appears larger minibatches are required to achieve lower error (64 performs better than 32, the authors didn't place minibatch size of 4 in this figure). The figure also suggests that minibatch size of 256 can achieve the same lowest error. First, this suggests if we do use smaller minibatches, we may have to sacrifice on task performance. Secondly, the speedup is drastically reduced if we do use larger minibatch sizes. Quoting the author in Section 5.5.1, the benefit of this algorithm is when models are "deeper and thinner", it remains unseen whether deeper and thinner models are needed (see next paragraph). One of the arguments of the authors are that this algorithm allows us to scale up experiments and explore much deeper models (i.e., the bigger the model, the more memory it consumes, using smaller minibatches allows us to fit bigger/deeper models). Assuming the same train/dev/test split is used compared to the DeepSpeech 2 paper, the authors results on the very deep (i.e., 48 layers) achieved a WER of 27.44, this compares to 15 WER using a model w/ only 7 recurrent layers (see table 6 of the Deep Speech 2 paper). It remains unclear whether such deep RNNs are useful at all (i.e., see Sak et. al., 2014/2015 for LSTM acoustic models, these tend to be 2-3 recurrent layers). Minor: The authors also evaluated on RNNs instead of LSTMs/GRUs. This reviewer would have liked the authors use GRUs or LSTMs, as they seem to perform quite a bit better (i.e., see Sak et. al, 2014 for acoustic modelling in speech recognition). Minor: The experimental results are hard to reproduce for anybody without access to the DeepSpeech datasets. Would have loved to see the risidual experiments ran on a more readily available speech dataset (Fisher / SWBD / WSJ / TIMIT etc...). On the other hand, such datasets probably don't need speedups since they can be quite easily trained in a few days. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Figure 5 should be completed (or at least the 256 line). Figure 6 should be completed (scale it up to 2048). 1] This reviewer is unconvinced smaller minibatches (i.e., 4) are useful: * It appears we can get equal if not slightly better task performance using larger minibatches * Its not entirely true that we can't run models that are very deep/large w/o using small minibatches (i.e., you can run them on CPUs, and CPUs are actually really fast for GEMM type ops w/ AVX2 instruction set) * The authors keep claiming how much faster the algorithm is at minibatch size of 4, however they themselves used minibatch size range \in [64, 512] for their risidual network experiments according to Section 5.3 2] We want to explore very deep risidual RNNs: * Based on the experimental results presented in this paper (and assuming they can be compared to the DS2 paper), it remains very unclear whether we want to use such deep RNNs, since the shallow RNNs perform better in terms of WER. However, this algorithm is very very very simple and can lead to massive improvements in performance. However, the implementation is probably very custom (and hard to reproduce w/o considerable engineering effort), as it will probably change from chip-to-chip (i.e., the authors targeted the Titanx). Minor: would be very nice if the authors open sourced a version to be implementable on Theano / TensorFlow / Torch etc.. for others to reproduce the experiments etc... =====