Poster Thu, Jul 9, 2026 • 5:00 PM – 6:45 PM KST Coex: HALL A

RTInfer: Exploiting Concurrency for Multiple Real-Time DNN Inference on Edge GPUs

Renjie Li ⋅ Tong Sun ⋅ Yi Gao ⋅ Wei Dong

Abstract

While edge GPUs are increasingly used for latency-critical DNN tasks, limited resources often fail to meet strict real-time (RT) requirements under concurrent workloads. Existing preemption and early-exit mechanisms often underutilize GPU resources through single-task queuing and sacrifice excessive accuracy during task bursts. To address this, we propose RTInfer, a novel system that enables concurrent RT task execution while balancing throughput and accuracy. RTInfer integrates an accuracy-calibrated lightweight variant co-optimization to generate efficient models, a memory-layout-aware scheduler to mitigate fragmentation during preemption, and an on-demand loading strategy to minimize host-to-GPU latency. Extensive evaluations demonstrate that RTInfer outperforms state-of-the-art methods by up to 98.2\% in deadline miss rate (DMR) and 58.0\% in accuracy.