Poster

DLLMQuant: A Post-Training Quantization Framework Tailored for Diffusion-Based Large Language Models

XUCHEN ⋅ Dawei Yang ⋅ Zhixuan Chen

Abstract

Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when directly applied to DLLMs (e.g., AWQ suffers a 16% accuracy drop on LLADA under W4A4). This paper explores how the unique mechanisms of Dynamic Language Models (DLLMs) conflict with quantization, identifying three core issues: 1) During the iterative generation process of DLLMs, dynamic masking ratios are inherently involved, leading to notable differences in token distributions across decoding steps. Unfortunately, these distinct distributions are not sufficiently captured by current PTQ calibration approaches; 2) Quantization errors propogate and accumalte progressively during iterations in DLLMs, leading to a gradual decline in the performance of quantized models as decoding steps advance; 3) The stability of unmasked tokens, combined with the probabilistic nature of masked tokens, gives rise to an overall feature distribution that is uncoordinated and unsuitable for PTQ. To address these issues, we propose DLLMQuant, a PTQ framework tailored for DLLMs, which incorporates three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS), a calibration method that accounts for both time and mask factors, with the capacity to capture distributions across timesteps. 2) Interaction-Aware Activation Quantization (IA-AQ), which utilizes bidirectional attention scores to identify important tokens, and prioritizes these tokens when minimizing quantization error. 3) Certainty-Guided Quantization (CGQ) incorporates mask status and token scores as core weighting criteria for error compensation, enabling PTQ to better align with the unique weight distribution of DLLMs. Experiments show that DLLMQuant achieves significant performance gains (e.g., over 10-point accuracy improvement on GSM8K for LLADA under 4-bit quantization) while enhancing efficiency.