Skip to yearly menu bar Skip to main content


Poster

Diffusion Language Models Are Versatile Protein Learners

Xinyou Wang · Zaixiang Zheng · Fei YE · Dongyu Xue · Shujian Huang · Quanquan Gu

Hall C 4-9 #110
[ ] [ Project Page ] [ Paper PDF ]
Tue 23 Jul 2:30 a.m. PDT — 4 a.m. PDT

Abstract:

This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training make DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2. Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioners, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance.

Live content is unavailable. Log in and register to view live content