ProactiveLLM: Learning Active Interaction for Streaming Large Language Models
Abstract
Standard Large Language Models (LLMs) operate on a ''read-then-generate'' paradigm, incurring avoidable latency and computational redundancy. Recently, streaming LLMs have attempted to overcome these bottlenecks by allowing input and output to unfold synchronously. However, this introduces a critical challenge: how should the LLM determine the optimal timing to interact with the input and output stream? Existing approaches remain confined to passive adaptation, relying on static or content-irrelevant interaction rules. In this paper, we propose ProactiveLLM, which achieves active interaction by treating ''when to generate" and ''what to generate" as decoupled objectives. Through masked streaming modeling and self-distillation, the model actively learns to perceive semantic sufficiency from partial inputs. This yields endogenous cues serving as a versatile interface for the plug-and-play integration of diverse decision heads customized for specific latency-accuracy trade-offs. Extensive evaluation across text and speech streaming tasks confirms that ProactiveLLM significantly reduces interaction latency while maintaining quality, validating its capacity for dynamic and active interaction.