CONTINUUM: Restoring the Contiguous Tensor Abstraction Efficiently for Dynamic AI Workloads via Hardware Virtualization
Abstract
Emerging LLM workloads demand extreme mem- ory agility. However, state-of-the-art inference systems (e.g., vLLM) rely on software-defined paging, which sacrifices the contiguous tensor abstraction. This rigid interface exposes fragmen- tation complexity to developers, imposing a se- vere engineering burden that stifles algorithmic innovation. We introduce CONTINUUM, a tensor memory virtualization subsystem implemented as a PyTorch extension. By bypassing serialized OS bottlenecks via a lightweight GPU driver ex- tension, CONTINUUM can significantly reduce the mapping costs by orders of magnitude—from milliseconds to microseconds. Built atop this low-latency API, CONTINUUM provides Elastic Tensor, with a set of flexible tensor operations that natively supports complex memory dynamics and zero-copy topological aliasing. Evaluations demonstrate that CONTINUUM achieves signifi- cantly higher throughput across diverse dynamic scenarios, effectively democratizing the imple- mentation of next-generation LLM applications.