Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
Abstract
End-to-end speech-in, speech-out dialogue systems are emerging as a powerful alternative to traditional ASR–LLM–TTS pipelines but remain prone to hallucinations due to limited factual grounding. While text-based dialogue models have effectively mitigated this issue through tools such as web search APIs, extending such capabilities to speech-in, speech-out systems remains underexplored. A key challenge is that tool integration increases latency, disrupting conversational flow. To mitigate this, we propose Streaming Retrieval-Augmented Generation (Stream RAG), a novel framework that reduces latency by predicting tool queries in parallel with user speech, even before the user finishes speaking. Specifically, we develop a post-training pipeline that teaches the model when to issue tool calls and how to generate spoken summaries using retrieved text results, thereby improving both accuracy and responsiveness. To evaluate our approach, we construct AudioCRAG, a benchmark created by converting queries from the publicly available CRAG dataset into speech form. Experimental results show that Stream RAG improves QA accuracy by over 20.0% absolute on AudioCRAG and achieves state-of-the-art performance, including outperforming cascaded systems, on the SLUE-SQA benchmark, while reducing latency by up to 57%. Stream RAG is modality-agnostic and can be applied equally to typed input, paving the way for more agentic, real-time AI assistants.