Expo Talk Panel
OmniLong: A Resource-Effective Context Scaling Framework for Multimodal LLM Fine-tuning
Yin Song · Chen Wu
West Ballroom C
This presentation introduces OmniLong, a novel computational framework addressing the fundamental challenge of context length scaling in multimodal large language models (MLLMs). While extended contextual understanding across high-frame-rate videos and lengthy documents represents a critical frontier for practical applications, current approaches necessitate substantial computational infrastructure that creates significant barriers to entry. OmniLong offers a paradigm shift through a cohesive architecture that simultaneously extends context across textual and visual modalities while reducing computational requirements. Through advanced sequence parallelism and strategic CPU-GPU memory management techniques, OmniLong demonstrates superior computational efficiency by successfully fine-tuning models on high-density video content comprising up to 2048 sampled frames while utilizing only 8 A100 GPUs. Empirical evaluation shows OmniLong-enhanced models consistently outperform their foundational counterparts on established benchmarks, with OmniLong-Qwen2.5-VL-7B achieving particularly notable results on the VideoMME leaderboard for video analysis tasks. This talk will present a comprehensive analysis of OmniLong's technical architecture, optimization methodology, and broader implications for democratizing access to state-of-the-art multimodal AI capabilities across research institutions and industrial applications with diverse resource constraints.
Live content is unavailable. Log in and register to view live content