Spotlight
in
Workshop: Accessible and Efficient Foundation Models for Biological Discovery
Training Compute-Optimal Protein Language Models
Xingyi Cheng · Bo Chen · Pan Li · Jing Gong · Jie Tang · Le Song
Keywords: [ Protein Language Models ] [ Scaling Law ]
We explore the optimal training of protein language models, an area crucial in biological research where guidance is limited. Most models are trained with extensive compute resources, emphasizing model size increases over efficient compute usage. Our research uses a large dataset of 939 million protein sequences, training over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion tokens to examine the relationships between model sizes, token numbers, and objectives.Initial findings show diminishing returns for Causal Language Models (CLM) and overfitting tendencies in Masked Language Models (MLM) when using the Uniref database. To combat this, we incorporated metagenomic sequences to diversify the training set and mitigate plateauing and overfitting. We derived scaling laws for CLM and MLM on Transformers, tailored to protein sequence data characteristics.To validate these scaling laws, we compared large-scale ESM-2 and PROGEN2 models in downstream tasks, including protein generation and structure- and function-related evaluations, within comparable pre-training compute budgets.