Keywords: [ Algorithms ] [ Architectures ] [ Systems and Software ] [ Other ] [ Deep Learning - General ]
Using Deep Neural Networks (DNNs) in machine learning tasks is promising in delivering high-quality results but challenging to meet stringent latency requirements and energy constraints because of the memory-bound and the compute-bound execution pattern of DNNs. We propose a big-little dual-module inference to dynamically skip unnecessary memory access and computation to speedup DNN inference. Leveraging the error-resilient feature of nonlinear activation functions used in DNNs, we propose to use a lightweight little module that approximates the original DNN layer, which is referred to as the big module, to compute activations of the insensitive region that are more error-resilient. The expensive memory access and computation of the big module can be reduced as the results are only used in the sensitive region. For memory-bound models, our method can reduce the overall memory access by 40% on average and achieve 1.54x to 1.75x speedup on a commodity CPU-based server platform with a negligible impact on model quality. In addition, our method can reduce the operations of the compute-bound ResNet model by 3.02x, with only a 0.5% accuracy drop.