Skip to yearly menu bar Skip to main content


Poster

Diffusion Protein Language Model for Protein Generation and Representation Learning

Xinyou Wang · Zaixiang Zheng · Fei YE · Dongyu Xue · Shujian Huang · Quanquan Gu


Abstract:

This paper introduces a diffusion protein language model (DPLM), a new versatile protein language model within a discrete diffusion probabilistic framework, which demonstrate strong generative and predictive capabilities for protein sequences.We first pretrain scalable DPLM from evolutionary-scale protein sequences with a generative self-supervised diffusion objective, which generalizes masked language modeling in a principled way. After pretraining, DPLM exhibits the ability to generate novel, diverse, and structurally-plausible protein sequences for unconditional generation.We further demonstrate the proposed diffusion generative pre-training make DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various downstream tasks, comparing favorably to ESM-2~\citep{lin2022esmfold}.DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) generating scaffolds for functional motifs with high success rate given only the motifs' peptide sequences; (2) incorporate other modalities as conditioner such as structure-conditioned generation for inverse folding; and (3) generating protein sequences satisfying desired properties through classifier-guidance.

Live content is unavailable. Log in and register to view live content