Poster Thu, Jul 9, 2026 • 1:00 AM – 2:45 AM PDT HALL A #4603

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi ⋅ Shuo Yang ⋅ Yilong Zhao ⋅ Muyang Li ⋅ Han Cai ⋅ Xingyang Li ⋅ Yujun Lin ⋅ Zhuoyang Zhang ⋅ Jintao Zhang ⋅ Xiuyu Li ⋅ Zhiying Xu ⋅ Jun Wu ⋅ Chenfeng Xu ⋅ Ion Stoica ⋅ Song Han ⋅ Kurt Keutzer

Abstract

Despite rapid progress in auto-regressive video diffusion, we identify an emerging system–algorithm bottleneck that limits both deployability and generation quality: KV-cache memory. In auto-regressive video generation models, the KV-cache grows with generation history and quickly dominates GPU memory (often ≥30 GB), preventing deployment on widely available hardware. More critically, memory-bounded KV budgets constrain the effective working memory, directly degrading long-horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training-free KV-cache quantization framework for auto-regressive video diffusion models. QVG exploits video’s inherent spatiotemporal redundancy via Semantic-Aware Smoothing, producing low-magnitude, quantization-friendly residuals. Building on this, QVG introduces Progressive Residual Quantization, a coarse-to-fine multi-stage scheme that further reduces quantization error while enabling a smooth quality–memory trade-off. Across LongCat-Video, HY-WorldPlay, and Self-Forcing, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV memory by up to 7.0× with less than 4% end-to-end latency overhead, while delivering significantly better generation quality than existing baselines.