HybridFlow: Resource-Adaptive Subtask Routing for Efficient Edge-Cloud LLM Inference
Abstract
Edge-cloud collaborative inference is crucial for LLM-powered edge devices, as on-device models often lack the required reasoning capability, while cloud-only inference can be costly and slow under strict latency and token/API budgets. However, existing edge-cloud collaboration methods typically route input tasks based on their estimated difficulty. These static, coarse heuristics overlook subtask dependencies, missing opportunities for parallel execution and budget-adaptive routing. To this end, we propose HybridFlow, a resource-adaptive edge-cloud inference framework that enables parallel execution of interdependent subtasks. Specifically, we build a dependency-aware DAG for each input task, facilitating concurrent execution of subtasks once their dependencies are resolved, thereby reducing end-to-end latency. Additionally, we propose a dynamic benefit–cost utility model, optimizing the trade-off between accuracy, token/API cost, and latency in real-time. This dynamic routing minimizes unnecessary cloud usage while preserving reasoning quality. Across GPQA, MMLU-Pro, AIME24, and LiveBench-Reasoning, HybridFlow improves the cost-accuracy trade-off, reducing latency and cloud API usage while maintaining competitive accuracy.