Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research

Uncovering Neural Scaling Law in Molecular Representation Learning

Dingshuo Chen · Yanqiao Zhu · Jieyu Zhang · Yuanqi Du · Zhixun Li · Qiang Liu · Shu Wu · Liang Wang


Abstract:

Molecular Representation Learning (MRL) has demonstrated great potential in a variety of tasks such as virtual screening for drug and materials discovery. Despite the widespread interests in advancing model-centric techniques, how the quantity and quality of molecular data affect the learned representation remains an open question in this field. In light of this, we investigate the neural scaling behaviors of MRL from a data-centric perspective across various dimensions, including (1) data modality, (2) data distribution, (3) pre-training intervention, and (4) model capacity. Our empirical studies confirm that the performance of MRL exhibits a power-law relationship with data quantity across aforementioned four dimensions. Moreover, our fine-grained analysis uncovers valuable factors that can be used to improve the learning efficiency. To seek the possibility to beat the scaling law, we adapt seven popular data pruning strategies to molecular data and benchmark their performances. Drawing from our experimental findings, we underscore the importance of data-centric MRL and discuss their potential for future research.

Chat is not available.