RTInfer: Exploiting Concurrency for Multiple Real-Time DNN Inference on Edge GPUs
Abstract
While edge GPUs are increasingly used for latency-critical DNN tasks, limited resources often fail to meet strict real-time (RT) requirements under concurrent workloads. Existing preemption and early-exit mechanisms often underutilize GPU resources through single-task queuing and sacrifice excessive accuracy during task bursts. To address this, we propose RTInfer, a novel system that enables concurrent RT task execution while balancing throughput and accuracy. RTInfer integrates an accuracy-calibrated lightweight variant co-optimization to generate efficient models, a memory-layout-aware scheduler to mitigate fragmentation during preemption, and an on-demand loading strategy to minimize host-to-GPU latency. Extensive evaluations demonstrate that RTInfer outperforms state-of-the-art methods by up to 98.2\% in deadline miss rate (DMR) and 58.0\% in accuracy.