PsumQuant: In-line Post-training Partial Sum Quantizer for Energy Efficient NPU Inference
Sangwoo Hwang ⋅ Yeeun Hong ⋅ Jaeha Kung
Abstract
The rapid growth of deep neural networks (DNNs) has intensified the demand for efficient hardware acceleration under quantization. While prior research has successfully reduced weight and activation precision, partial sums generated during accumulation often retain high precision, resulting in significant energy overhead. In this work, we analyze psum distributions in tiled architectures and reveal that within-tile outliers are input-dependent. We propose PsumQuant, a post-training, input-aware quantization that predicts psum scales on-the-fly. By leveraging the crest factor of input activations, our learnable scale predictor effectively bounds the psum bit-width while handling the extreme outliers in DNNs. Experimental results on a $128 \times 128$ systolic array demonstrate that PsumQuant compresses psum precision down to 8-bit within only a 1\% accuracy drop on ResNet-18 and a marginal 0.04 perplexity increase on Llama-3.1. Furthermore, bit-width reduction with PsumQuant results in a 45\% reduction in total energy with minimal accuracy loss, demonstrating that PsumQuant provides a highly efficient solution for actual NPU architectures.
Successful Page Load