Skip to yearly menu bar Skip to main content


Boosting Deep Neural Network Efficiency with Dual-Module Inference

Liu Liu · Lei Deng · Zhaodong Chen · yuke wang · Shuangchen Li · Jingwei Zhang · Yihua Yang · Zhenyu Gu · Yufei Ding · Yuan Xie

Keywords: [ Deep Learning - General ] [ Other ] [ Algorithms ] [ Systems and Software ] [ Architectures ]


Using Deep Neural Networks (DNNs) in machine learning tasks is promising in delivering high-quality results but challenging to meet stringent latency requirements and energy constraints because of the memory-bound and the compute-bound execution pattern of DNNs. We propose a big-little dual-module inference to dynamically skip unnecessary memory access and computation to speedup DNN inference. Leveraging the error-resilient feature of nonlinear activation functions used in DNNs, we propose to use a lightweight little module that approximates the original DNN layer, which is referred to as the big module, to compute activations of the insensitive region that are more error-resilient. The expensive memory access and computation of the big module can be reduced as the results are only used in the sensitive region. For memory-bound models, our method can reduce the overall memory access by 40% on average and achieve 1.54x to 1.75x speedup on a commodity CPU-based server platform with a negligible impact on model quality. In addition, our method can reduce the operations of the compute-bound ResNet model by 3.02x, with only a 0.5% accuracy drop.

Chat is not available.