Mosaic: Unlocking Over 30$\times$ Context Length for Diffusion LLMs Inference via Global Memory Planning and Dynamic Peak Taming
Zheng Liang ⋅ Bowen Shi ⋅ Yitao Hu ⋅ Jiawei Zhang ⋅ Ruofan Li ⋅ Guotao Yang ⋅ Zhixin Zhao ⋅ Zhengchao Wang ⋅ Sheng Chen ⋅ Wenxin Li ⋅ Dezhi Ran ⋅ Tao Xie ⋅ Keqiu Li
Abstract
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive models, leveraging simultaneous denoising to enable global planning and iterative refinement. These properties make dLLMs particularly attractive for long-context generation. However, deploying dLLMs faces a prohibitive memory capacity barrier, as existing inference systems are inefficient for the diffusion paradigm. We observe that current inference systems are misaligned with dLLMs. Unlike autoregressive models, whose memory footprint is dominated by the cumulative KV-Cache, dLLMs are bottlenecked by transient activations rematerialized per step. Moreover, generic memory reuse mechanisms lack the global visibility to handle dynamic memory peaks of dLLMs, which alternate between logits and feed-forward networks. To address these challenges, we present Mosaic, a memory-efficient inference system that shifts dLLM execution from local, static memory management to a global, dynamic paradigm. Mosaic integrates (i) a mask-only logits kernel eliminating redundant activation materialization, (ii) a lazy chunking optimizer using online heuristics to adaptively tame dynamic memory peaks, and (iii) a global memory manager leveraging virtual addressing to mitigate memory fragmentation. Extensive evaluations show that Mosaic reduces the memory peak-to-average ratio by 2.71$\times$ on average and increases the maximum supportable inference sequence length on identical hardware by 15.30--32.34$\times$. Crucially, Mosaic is training-free and preserves exact model outputs, while simultaneously reducing end-to-end latency by 2.5\%--55.4\%.
Successful Page Load