Timezone: »
Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.
Author Information
Minghao Xu (Mila - Quebec AI Institute)
Xinyu Yuan (Mila / UdeM)
Santiago Miret (Intel Labs)
Jian Tang (Mila)
Related Events (a corresponding poster, oral, or spotlight)
-
2023 Poster: ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts »
Wed. Jul 26th 09:00 -- 10:30 PM Room Exhibit Hall 1 #506
More from the Same Authors
-
2022 : Evaluating Self-Supervised Learned Molecular Graphs »
Hanchen Wang · Shengchao Liu · Jean Kaddour · Qi Liu · Jian Tang · Matt Kusner · Joan Lasenby -
2022 : GAUCHE: A Library for Gaussian Processes in Chemistry »
Ryan-Rhys Griffiths · Leo Klarner · Henry Moss · Aditya Ravuri · Sang Truong · Yuanqi Du · Arian Jamasb · Julius Schwartz · Austin Tripp · Bojana Ranković · Philippe Schwaller · Gregory Kell · Anthony Bourached · Alexander Chan · Jacob Moss · Chengzhi Guo · Alpha Lee · Jian Tang -
2022 : Flaky Performances when Pre-Training on Relational Databases with a Plan for Future Characterization Efforts »
Shengchao Liu · David Vazquez · Jian Tang · Pierre-André Noël -
2022 : Protein Representation Learning by Geometric Structure Pretraining »
Zuobai Zhang · Zuobai Zhang · Minghao Xu · Minghao Xu · Arian Jamasb · Arian Jamasb · Vijil Chenthamarakshan · Vijil Chenthamarakshan · Aurelie Lozano · Payel Das · Payel Das · Jian Tang · Jian Tang -
2022 : Evaluating Self-Supervised Learned Molecular Graphs »
Hanchen Wang · Hanchen Wang · Shengchao Liu · Shengchao Liu · Jean Kaddour · Jean Kaddour · Qi Liu · Qi Liu · Jian Tang · Jian Tang · Matt Kusner · Matt Kusner · Joan Lasenby · Joan Lasenby -
2023 : A*Net: A Scalable Path-based Reasoning Approach for Knowledge Graphs »
Zhaocheng Zhu · Xinyu Yuan · Mikhail Galkin · Louis-Pascal Xhonneux · Ming Zhang · Maxime Gazeau · Jian Tang -
2023 : Unsupervised Discovery of Steerable Factors in Graphsc »
Shengchao Liu · Chengpeng Wang · Weili Nie · Hanchen Wang · Jiarui Lu · Bolei Zhou · Jian Tang -
2023 : Score-based Enhanced Sampling for Protein Molecular Dynamics »
Jiarui Lu · Bozitao Zhong · Jian Tang -
2023 : Using Multiple Vector Channels Improves $E(n)$-Equivariant Graph Neural Networks »
Daniel Levy · Sékou-Oumar Kaba · Carmelo Gonzales · Santiago Miret · Siamak Ravanbakhsh -
2023 : Evolving Computation Graphs »
Andreea Deac · Jian Tang -
2023 Poster: A Group Symmetric Stochastic Differential Equation Model for Molecule Multi-modal Pretraining »
Shengchao Liu · weitao du · Zhiming Ma · Hongyu Guo · Jian Tang -
2023 Poster: FusionRetro: Molecule Representation Fusion via In-Context Learning for Retrosynthetic Planning »
Songtao Liu · Zhengkai Tu · Minkai Xu · Zuobai Zhang · Lu Lin · ZHITAO YING · Jian Tang · Peilin Zhao · Dinghao Wu -
2023 Poster: FAENet: Frame Averaging Equivariant GNN for Materials Modeling »
ALEXANDRE DUVAL · Victor Schmidt · Alex Hernandez-Garcia · Santiago Miret · Fragkiskos Malliaros · Yoshua Bengio · David Rolnick -
2023 Poster: Multi-Objective GFlowNets »
Moksh Jain · Sharath Chandra Raparthy · Alex Hernandez-Garcia · Jarrid Rector-Brooks · Yoshua Bengio · Santiago Miret · Emmanuel Bengio -
2022 Workshop: The First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward »
Huaxiu Yao · Hugo Larochelle · Percy Liang · Colin Raffel · Jian Tang · Ying WEI · Saining Xie · Eric Xing · Chelsea Finn -
2022 Poster: Generative Coarse-Graining of Molecular Conformations »
Wujie Wang · Minkai Xu · Chen Cai · Benjamin Kurt Miller · Tess Smidt · Yusu Wang · Jian Tang · Rafael Gomez-Bombarelli -
2022 Spotlight: Generative Coarse-Graining of Molecular Conformations »
Wujie Wang · Minkai Xu · Chen Cai · Benjamin Kurt Miller · Tess Smidt · Yusu Wang · Jian Tang · Rafael Gomez-Bombarelli -
2020 Poster: Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination »
Somdeb Majumdar · Shauharda Khadka · Santiago Miret · Stephen Mcaleer · Kagan Tumer -
2019 Poster: Collaborative Evolutionary Reinforcement Learning »
Shauharda Khadka · Somdeb Majumdar · Tarek Nassar · Zach Dwiel · Evren Tumer · Santiago Miret · Yinyin Liu · Kagan Tumer -
2019 Oral: Collaborative Evolutionary Reinforcement Learning »
Shauharda Khadka · Somdeb Majumdar · Tarek Nassar · Zach Dwiel · Evren Tumer · Santiago Miret · Yinyin Liu · Kagan Tumer