Response to 1$
Thank you for reviewing our work.

It seems like we were not clear enough about motivating the case for making small mini-batch sizes efficient.  This is helpful feedback for us because we want to make this point clear.

We want to state up front that we are targeting high performance clusters of tens to thousands of high end GPUs with this work, and are focused on improving computational throughput beyond the state of the art.  The issues addressed in this paper may not apply to systems with slower processors, or systems with fewer GPUs, but those systems are explicitly not the focus of this work.

There are several effects of changing the mini-batch size, and it may help to consider them separately.

1) From a hardware perspective, increasing the mini-batch size (per GPU) increases GEMM efficiency.
2) From a hardware perspective, increasing the mini-batch size (per GPU) increases the amount of memory required to train a model.
3) From an optimization perspective, using too large of a mini-batch (the combined total across all GPUs) can increase the total amount of computational work required to train a model.

We can view these effects as constraints on the mini-batch size that minimizes training time for a given model.

Effect 1 places a lower bound on the mini-batch size (if the mini-batch size is too low, the GPU runs inefficiently).  Effect 3 places an upper bound on it (if the mini-batch size is too high, then further increases add computational work, but don't reduce the number of iterations needed for convergence, thereby increasing training time).  Effect 2 places an upper bound on the maximum model size.

In Figure 5, we are quantifying the impact of effect 
3) Using a mini-batch size of 4 is too small for this dataset, but the ability to run efficiently with a minibatch size of 4 *for each GPU* allows the use of more GPUs.  We are not advocating training at an algorithmic mini-batch size of 4 in Figure 5.  Instead the idea is to use an algorithmic mini-batch size between 128 and 1024 (as you say), but to use a mini-batch size per-GPU of 4, so that the model could be scaled across 32 to 256 GPUs.

Regarding the question of whether very deep residual RNNs are useful, we agree that their utility compared to the state of the art is unclear and that further experiments are required.  We do not claim that these models achieve state of the art results.  Our motivation for including these results is that the use of persistent kernels enabled us to explore bigger and deeper models on current generation hardware, which would otherwise be too slow or too memory constrained.  We do see some evidence that accuracy improves with depth up to 88 layers when using batch normalization and residual connections for our 17,000 hour dataset, which we think the community would be interested to know about.  It should be clear that these results are not directly comparable with other papers that use different models (e.g. bidirectional or GRU/LSTMs vs forward-only simple RNNs) or a different training dataset.  We plan to continue to experiment with model architectures at these scales of data and model sizes in future work.

Regarding the evaluation of RNNs instead of LSTMs/GRUs, we agree it would be interesting to study these architectures.  There is a significant amount of effort required to write high performance GPU kernels like these and we have not completed GRU/LSTM implementations.  We are actively working on them as future work.

Regarding CPU vs GPU performance, Intel's fastest Xeon processor is currently the E5 2699v3 processor, which has a peak single precision floating point throughput of approximately 1.3 TFLOP/s.  This is approximately 5x lower than a single TitanX GPU's 6.1 TFLOP/s. The difference is even bigger if you consider dense GPU servers with 8 GPUs connected to up to 2 CPUs.  Additionally, we would like to note that CPU GEMM implementations are also sensitive to small batch sizes, and although they are less memory capacity constrained, requiring a large batch size per CPU could still limit the maximum amount of data-parallel scaling.

We plan on completing Figure 5 for the final submission.  The runs have finished at this point and the conclusions are not changed.

We chose to not extend Figure 6 for the submission because it lost too much detail in scaling trend in the domain of 1-16, but we definitely see the value in showing the complete trend.  If accepted, we can augment the figure to include the complete range, as well as zoom in on the smaller batch sizes.

Finally, the comment about the desire to open source our implementation is useful feedback for us.  The assembly level implementation is somewhat hard to integrate into existing frameworks given the use of nonstandard assembler tools, but we are working on a higher level implementation in CUDA that will hopefully achieve a similar level of performance, and be much easier to open source.