Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning
Abstract
Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key–value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure—particularly for visual representations—and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs Task-Vector Guided Compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying information manifold, it further applies Semantics-Aware Token Merging, formulating compression as a Bipartite Graph Matching problem that merges tokens without destructive pruning. Finally, TASM organizes compressed representations into a multi-resolution hierarchy consisting of a compact Core Memory and a Latent Bank, enabling Query-Adaptive Dynamic Activation and Dynamic Retrieval at inference time. Empirical evaluations show that TASM sustains strong multi-modal ICL performance under high compression ratios, demonstrating an effective balance between efficiency, adaptability, and semantic fidelity.