Towards Multimodal Large Language Models with Both Training and Inference Efficiency
Abstract
Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g. LLaVA series) is inefficient during inference, while cross-attention space alignment (e.g. Flamingo) is inefficient in training. A primary difference between them lies in whether each visual token attends to other tokens within the LLM backbones. To investigate whether this form of attention is essential for MLLMs, we propose NAEViT (No AttEntion from Visual Tokens), an attention mechanism that eliminates such interactions. Our pilot experiment shows that attention from visual tokens is highly redundant. Then, we introduce SAISA (Self-Attention Input Space Alignment), a novel architecture that enhances both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAEViT attention blocks, reducing computational overhead in both attention and FFNs. We conduct experiments on various baseline models, model sizes and training datasets. SAISA achieves superior performance compared to the baselines, while significantly reducing computational costs. Further ablation studies validate the effectiveness of SAISA across various LLMs and visual encoders