Unifying Stacking and Cascading for Efficient Ensemble Inference
Abstract
We introduce LazyStack, a method for efficient model ensemble inference. The core idea is intuitive: after each model executes, we check whether accumulated evidence is sufficient to exit confidently. Sometimes one model suffices; other times we aggregate predictions from several models via trained meta-learners before reaching confidence. Two insights make this work. First, most inputs follow only 3 to 8 execution trajectories. This reduces the training problem from exponential to linear: we learn aggregators only for these common paths, not all possible model combinations. Second, we formulate trajectory selection as an MDP and use value iteration to compute the optimal routing policy, which reveals counterintuitive model orderings. On intrusion detection, starting with a moderately expensive model outperforms starting with the cheapest, because its higher confidence enables earlier overall exit. Across vision, text, tabular, and LLM tasks, we achieve up to 38x speedup at 97%+ accuracy retention compared to a complete ensemble. The result: ensemble-quality predictions at cascade-level cost.