SAGE: A Dataflow-Native Framework for Modular, Controllable, and Transparent LLM-Augmented Reasoning
Jun Liu ⋅ Peilin Liu ⋅ Ruicheng Zhang ⋅ Senlei Zhang ⋅ Yanbo Chen ⋅ Ziao Wang ⋅ Jinyun Yang ⋅ mingqi wang ⋅ Shuhao Zhang ⋅ Xiaofei Liao ⋅ Hai Jin
Abstract
LLM applications increasingly execute as end-to-end inference pipelines that couple generation with retrieval, stateful memory, context refinement, and tool use under strict tail-latency and SLO constraints. Today, these stages are often stitched together as RPC-connected services, obscuring cross-stage queueing and interference and limiting pipeline-level compilation and resource sharing. We present SAGE (Streaming-Augmented Generative Execution), a full-stack system that treats inference pipelines as first-class compilation targets. SAGE exposes pipelines as declarative dataflows and compiles them into distributed execution plans with bounded-queue backpressure. It integrates vector search, streaming semantic state, structured memory, and refinement as operators with explicit resource/state contracts, enabling operator-level diagnosis of tail behavior. SAGE integrates pluggable generation and embedding backends and provides a unified control plane for engine management, batching, and admission under mixed workloads. On a 16-node cluster, SAGE sustains 16 requests/s at $>700$ tokens/request with 1 ms median scheduling overhead, and achieves near-linear scale-out to 16 nodes (11.4$\times$ throughput at 16 nodes), and reduces p99 latency by 57\% under multi-pipeline contention versus simultaneous admission.
Successful Page Load