Poster
in
Workshop: AI for Science: Scaling in AI for Scientific Discovery
Exploring Neural Scaling Laws in Molecular Pretraining with Synthetic Tasks
Rodrigo S. Hormazabal · Seung Woo Ko · Inwan Yoo · Sehui Han · Paul Bertens
Keywords: [ foundation models ] [ molecular representation learning ] [ Neural Scaling Laws ] [ Synthetic Pretraining ]
Acquiring large-scale annotated molecular datasets has been one of the main bottlenecks in scaling foundational models for chemistry applications. Proprietary issues, cost, and complex experimental setups have led to restrictively small labeled datasets for some predictive tasks. Unlike in language models, where pre-training has shown significant improvements in downstream tasks, molecular pre-training has yet to demonstrate a similar impact. Inspired by the success of neural scaling laws, we explore the impact of synthetic pre-training for molecular machine learning models on their downstream performance. We pre-train models of three sizes (12M, 45M, 230M parameters) using 12 synthetic tasks in an encoder-decoder setup and evaluate their performance across 10 downstream tasks. Our findings reveal a general correlation between pre-training perplexity and downstream task performance, with this relationship varying across different tasks. These insights suggest that pre-training metrics could provide valuable estimates of model performance post-fine-tuning. We pre-train models with over 10B tokens and observe that models saturate, indicating potential for further parameter scaling. This study represents a preliminary exploration of synthetic task pre-training for molecular models, which can be complementary to other pre-training methods such as multi-task learning with labeled data and multimodal learning.