Timezone: »
Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4- 4.0x for INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has been open-sourced.
Author Information
Sehoon Kim (University of California, Berkeley)
Amir Gholaminejad (University of California, Berkeley)
Zhewei Yao (University of California, Berkeley)
Michael Mahoney (UC Berkeley)
EECS Kurt Keutzer (EECS, UC Berkeley)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Oral: I-BERT: Integer-only BERT Quantization »
Thu. Jul 22nd 01:00 -- 01:20 PM Room
More from the Same Authors
-
2023 Poster: Understanding Int4 Quantization for Language Models: Latency Speedup, Composability, and Failure Cases »
Xiaoxia Wu · Cheng Li · Reza Yazdani Aminabadi · Zhewei Yao · Yuxiong He -
2021 Workshop: Beyond first-order methods in machine learning systems »
Albert S Berahas · Anastasios Kyrillidis · Fred Roosta · Amir Gholaminejad · Michael Mahoney · Rachael Tappenden · Raghu Bollapragada · Rixon Crane · J. Lyle Kim -
2021 Poster: HAWQ-V3: Dyadic Neural Network Quantization »
Zhewei Yao · Zhen Dong · Zhangcheng Zheng · Amir Gholaminejad · Jiali Yu · Eric Tan · Leyuan Wang · Qijing Huang · Yida Wang · Michael Mahoney · EECS Kurt Keutzer -
2021 Spotlight: HAWQ-V3: Dyadic Neural Network Quantization »
Zhewei Yao · Zhen Dong · Zhangcheng Zheng · Amir Gholaminejad · Jiali Yu · Eric Tan · Leyuan Wang · Qijing Huang · Yida Wang · Michael Mahoney · EECS Kurt Keutzer -
2021 Poster: ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training »
Jianfei Chen · Lianmin Zheng · Zhewei Yao · Dequan Wang · Ion Stoica · Michael Mahoney · Joseph E Gonzalez -
2021 Oral: ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training »
Jianfei Chen · Lianmin Zheng · Zhewei Yao · Dequan Wang · Ion Stoica · Michael Mahoney · Joseph E Gonzalez -
2020 : Spotlight talk 3 - PyHessian: Neural Networks Through the Lens of the Hessian »
Amir Gholaminejad -
2020 Workshop: Beyond first order methods in machine learning systems »
Albert S Berahas · Amir Gholaminejad · Anastasios Kyrillidis · Michael Mahoney · Fred Roosta -
2020 Poster: PowerNorm: Rethinking Batch Normalization in Transformers »
Sheng Shen · Zhewei Yao · Amir Gholaminejad · Michael Mahoney · Kurt Keutzer -
2018 Poster: Regret Minimization for Partially Observable Deep Reinforcement Learning »
Peter Jin · EECS Kurt Keutzer · Sergey Levine -
2018 Oral: Regret Minimization for Partially Observable Deep Reinforcement Learning »
Peter Jin · EECS Kurt Keutzer · Sergey Levine