ICML Expo Talk Panel MiMo-V2.5 Series: Efficient Intelligence via Architecture–Training

Expo Talk Panel

MiMo-V2.5 Series: Efficient Intelligence via Architecture–Training–Inference Co-Design

Fuli Luo ⋅ Wenhan Ma ⋅ Weimin Xiong ⋅ Lei Li ⋅ Shijie Cao

HALL C

[ Abstract ]

Sun 5 Jul 7:30 p.m. PDT — 8:30 p.m. PDT

Abstract:

The competitive frontier of foundation models has shifted from single-turn reasoning to sustained autonomous execution. The central question is no longer how well a model thinks, but whether it can operate as a reliable agent — maintaining coherence across thousands of decision steps, coordinating multimodal perception and action, and doing so within practical cost budgets. This talk distills three transferable design principles from our experience building and deploying the MiMo-V2.5 open-source model family.

Long-horizon stability requires architectural guarantees. A hybrid sliding-window / global attention scheme (6:1 ratio) compresses KV-cache by ~7× and enables native million-token context. MiMo-V2.5-Pro (1.02T / 42B active) sustains coherent trajectories over nearly 2,000 tool calls — autonomously completing a full SysY compiler in Rust (4.3 h) and an 8,192-line video editor (11.5 h), both passing all tests on first submission.

Token efficiency is the binding constraint on deployability. Architectural compression, 3-layer multi-token prediction, and Multi-Teacher On-Policy Distillation (MOPD) jointly yield 40–60% token savings over frontier models at matched performance (SWE-Bench Verified 78.9, TerminalBench 2.0 68.4).

Omnimodality closes the perception-expression loop. MiMo-V2.5 (310B / 15B active, 48T training tokens) unifies vision, audio, and language in a single sparse MoE. MiMo-V2.5-TTS enables instruction-steered emotion and timbre control with zero-shot voice design and few-second cloning. MiMo-V2.5-ASR achieves state-of-the-art recognition across dialects, code-switching, and noisy conditions via RL-augmented training.

All models are MIT-licensed. We share the trade-offs, failure modes, and scaling lessons behind each principle, and conclude with open questions toward agents capable of narrative-level planning and closed-loop embodied action.

Live content is unavailable. Log in and register to view live content