Position: Deciphering the Functions of DNAs, RNAs, and Proteins Should Consider Multi-Modal Large Language Models
Abstract
Understanding the functions of DNAs, RNAs, and proteins is fundamental to advancing life science research and enabling translational applications such as drug discovery and precision medicine. While deep learning methods have shown promise in biomolecular function prediction, they typically constrain outputs to predefined categories and require training separate models for each task. Existing multi-task learning methods operate on a fixed set of predefined tasks and require model retraining when new tasks arise. Furthermore, current approaches produce one-shot, static outputs, lacking the capacity for iterative refinement or deeper exploration of predictions. This position paper argues that multi-modal large language models (LLMs) are essential for enabling free-form and interactive prediction of biomolecular functions, and zero-shot generalization to new tasks without model retraining. These models can generate coherent and context-aware text outputs that reflect the complexity and nuance of diverse functional roles. Importantly, they can generalize to novel biomolecules whose functions are unknown or poorly characterized, and they enable generalization to new tasks through prompt-driven adaptation, eliminating the need for task-specific retraining. Additionally, multi-modal LLMs enable interactive, multi-turn dialogue, allowing users to iteratively refine queries, clarify contexts, and explore hypotheses in a dynamic and responsive manner. By leveraging these capabilities, multi-modal LLMs provide a scalable, adaptable, and generalizable framework for advancing biomolecular function prediction and accelerating biological discovery.