Poster Wed, Jul 8, 2026 • 5:00 PM – 6:45 PM KST Coex: HALL A

CONTINUUM: Restoring the Contiguous Tensor Abstraction Efficiently for Dynamic AI Workloads via Hardware Virtualization

Yangyu Zhang ⋅ shuoming zhang ⋅ Chunwei Xia ⋅ Shuaijiang Li ⋅ Zhicheng Li ⋅ Ruiyuan Xu ⋅ Zheming Yang ⋅ Lei Chen ⋅ YUAN WEN ⋅ Guangli Li ⋅ Xiaobing Feng ⋅ Huimin Cui ⋅ Jiacheng Zhao

Abstract

Emerging LLM workloads demand extreme mem- ory agility. However, state-of-the-art inference systems (e.g., vLLM) rely on software-defined paging, which sacrifices the contiguous tensor abstraction. This rigid interface exposes fragmen- tation complexity to developers, imposing a se- vere engineering burden that stifles algorithmic innovation. We introduce CONTINUUM, a tensor memory virtualization subsystem implemented as a PyTorch extension. By bypassing serialized OS bottlenecks via a lightweight GPU driver ex- tension, CONTINUUM can significantly reduce the mapping costs by orders of magnitude—from milliseconds to microseconds. Built atop this low-latency API, CONTINUUM provides Elastic Tensor, with a set of flexible tensor operations that natively supports complex memory dynamics and zero-copy topological aliasing. Evaluations demonstrate that CONTINUUM achieves signifi- cantly higher throughput across diverse dynamic scenarios, effectively democratizing the imple- mentation of next-generation LLM applications.