Skip to yearly menu bar Skip to main content


Poster
in
Workshop: AI for Science: Scaling in AI for Scientific Discovery

PAIR: Boosting the Predictive Power of Protein Representations with a Corpus of Text Annotations

Haonan Duan · Marta Skreta · Leonardo Cotta · Ella Rajaonson · Nikita Dhawan · Alan Aspuru-Guzik · Chris Maddison

Keywords: [ Multimodal Learning ] [ protein function predictions ] [ Protein Language Models ]


Abstract:

Protein language models trained on raw amino acid sequences have demonstrated impressive success in various protein function prediction tasks. One explanation for this success is that language modeling for amino acid sequences captures the local evolutionary fitness landscape and, therefore, encourages the models to extract rich information about the structure and function of a protein. Yet, detecting distant evolutionary relationships from sequences alone is a challenge. In this work, we conduct a comprehensive study examining the effects of training protein models on nineteen types of expertly-curated function annotations in Swiss-Prot. We find that different annotation types had varying effects on the quality of the learned representations, with some even degrading the model's performance. However, by incorporating a carefully-selected subset of annotation types, we are able to improve the model's function prediction performance. Notably, unlike existing protein models, our approach either matches or outperforms the widely-used bioinformatics tool BLAST in annotating previously uncharacterized proteins.

Chat is not available.