Towards A Generative Protein Evolution Machine with DPLM-Evo
Abstract
Proteins are shaped by gradual evolution under biophysical and functional constraints. Protein language models learn rich evolutionary constraints from large-scale sequence data, and discrete diffusion–based protein language models (e.g., DPLMs) have emerged as a promising framework for both understanding and generation. However, existing DPLMs typically rely on masking-based absorbing diffusion, which conflicts with a basic biological intuition: proteins evolve through accumulated edits rather than emerging from masked tokens. As a result, these frameworks lack explicit pretraining objectives for substitution and insertion/deletion (indel) operations, which in turn limits both optimization-style post-editing and flexible guided generation. To address these limitations, we present DPLM-Evo, an evolutionary discrete diffusion framework that explicitly predicts substitution, insertion, and deletion operations during denoising. \method decouples a fixed-length latent alignment space from the variable-length observed sequence space, making indel-aware generation tractable and enabling adaptive scaffold growth throughout the process with negligible computational overhead. To further align substitutions with real evolutionary dynamics, we introduce a contextual evolutionary noising kernel that induces biologically informed, context-dependent mutation patterns. Across tasks, \method improves sequence understanding and achieves state-of-the-art performance on ProteinGym in the single-sequence setting, while also enabling variable-length simulated evolution, guided generation, and post-editing or optimization of existing proteins via explicit edit trajectories.