Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models
PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications
Kshitij Bhardwaj
While Vision Transformers (ViTs) are extremelyeffective at computer vision tasks and are replacingconvolutional neural networks as the newstate-of-the-art, they are complex and memoryintensivemodels. In order to effectively run thesemodels on resource-constrained mobile/edge systems,there is a need to not only compress thesemodels but also to optimize them and convertthem into deployment-friendly formats. To thisend, this paper presents a combined pruning andquantization tool, called PQV-Mobile, to optimizevision transformers for mobile applications. Thetool is able to support different types of structuredpruning based on magnitude importance,Taylor importance, and Hessian importance. Italso supports quantization from FP32 to FP16 andint8, targeting different mobile hardware backends.We demonstrate the capabilities of our tooland show important latency-memory-accuracytrade-offs for different amounts of pruning andint8 quantization with Facebook Data EfficientImage Transformer (DeiT) models. Our resultsshow that even pruning a DeiT model by 9.375%and quantizing it to int8 from FP32 followed byoptimizing for mobile applications, we find a latencyreduction by 7.18× with a small accuracyloss of 2.24%. We plan to open-source this tool