Poster Mon, Jul 6, 2026 • 6:30 PM – 8:15 PM PDT HALL A #1600

Toward Safe Quantization-Aware Fine-tuning: Understanding and Mitigating Safety Alignment Degradation

Yuning Yang ⋅ Guowei Peng ⋅ Xiurui Xie ⋅ Minrui Jiang ⋅ Shuang Liang ⋅ Guisong Liu

Abstract

Large language models (LLMs) are increasingly adapted to downstream tasks in resource-constrained scenarios, making quantization-aware fine-tuning (QAF) a common practice for practical deployment. However, we find that quantized LLMs are substantially more vulnerable to safety alignment degradation during fine-tuning than full-precision models by interpretability analyses. In this paper, we first theoretically reveal that this vulnerability is driven by quantization errors, manifesting as an initial safety shift followed by a distorted optimization path. Based on this insight, we propose Explicit-Safety Quantization-Aware Fine-tuning (ExSQF), which effectively restores model safety while preserving downstream performance. It initializes adapters by combining quantization error with a safety matrix projection to mitigate early safety shifts, followed by post-training refinement that corrects deviations in the optimization path. Extensive experimental results show that ExSQF achieves state-of-the-art safety alignment recovery, even surpassing existing full-precision safety-aware fine-tuning baseline, while effectively preserving model performance.

Lay Summary

Large language models (LLMs) are commonly adapted with quantization-aware fine-tuning (QAF) for efficient deployment, but we find that quantized LLMs are substantially more vulnerable to safety alignment degradation during fine-tuning than full-precision models. We propose ExSQF, a safety-aware QAF method that explicitly restores model safety through safety-guided adapter initialization and post-training refinement, correcting the initial safety shift and distorted optimization path caused by quantization errors. Extensive experiments on multiple models and datasets demonstrate that ExSQF effectively mitigates safety degradation while preserving downstream task performance.