Boosting Deep Neural Network Efficiency with Dual-Module Inference

Liu Liu, Lei Deng, Zhaodong Chen, yuke wang, Shuangchen Li, Jingwei Zhang, Yihua Yang, Zhenyu Gu, Yufei Ding, Yuan Xie,

Abstract Paper

Wed Jul 15 8 a.m. PDT [iCal] [ Join Zoom ]
Wed Jul 15 7 p.m. PDT [iCal] [ Join Zoom ]
Please do not share or post zoom links


Using Deep Neural Networks (DNNs) in machine learning tasks is promising in delivering high-quality results but challenging to meet stringent latency requirements and energy constraints because of the memory-bound and the compute-bound execution pattern of DNNs. We propose a big-little dual-module inference to dynamically skip unnecessary memory access and computation to speedup DNN inference. Leveraging the error-resilient feature of nonlinear activation functions used in DNNs, we propose to use a lightweight little module that approximates the original DNN layer, which is referred to as the big module, to compute activations of the insensitive region that are more error-resilient. The expensive memory access and computation of the big module can be reduced as the results are only used in the sensitive region. For memory-bound models, our method can reduce the overall memory access by 40% on average and achieve 1.54x to 1.75x speedup on a commodity CPU-based server platform with a negligible impact on model quality. In addition, our method can reduce the operations of the compute-bound ResNet model by 3.02x, with only a 0.5% accuracy drop.

Chat is not available.