MixFP4: Extending NVFP4 to Mixed Micro-Format via Scale-Bit Reuse and Tensor Core Co-design
Abstract
As large language models continue to scale, fine-grained, block-scaled low-precision formats such as NVFP4 and MXFP4 are increasingly adopted for their substantial throughput and memory benefits. In this regime, floating-point and integer quantizers exhibit complementary strengths in matching block-level data distributions. However, tensor-core–accelerated matrix multiplications typically require all operands—weights and activations in the forward pass, and weights, activations, and gradients in the backward pass—to share a single quantization format, which can destabilize training and degrade inference quality. To address this limitation, we introduce MixFP4, a tensor-core–co-designed quantization scheme that evaluates two candidate scale factors for each block (corresponding to FP- and INT-style quantization behaviors) and selects the one that minimizes quantization error, thereby combining the benefits of both representations while preserving efficient GEMM execution.