Learning Compressed Shape-Aware Molecular Representations for Virtual Screening
Robin Winter ⋅ Julian Cremer ⋅ Djork-Arné Clevert
Abstract
Virtual screening of billion-scale molecular libraries based on 3D shape similarity remains computationally prohibitive, requiring expensive conformational sampling and alignment, as done by established tools like *ROCS*. Here, we introduce *SAND* (**S**hape-**A**ware **N**eural **D**escriptor), a method that can retrieve shape similar molecules from their 2D graph alone. Our approach makes two key contributions: (1) a rank-preserving contrastive learning framework using differentiable Spearman correlation that results into representations where similarity strongly correlates with 3D molecular shape overlap (R=0.86), and (2) an end-to-end learned quantization-aware training scheme that jointly optimizes the encoder with a two-level IVF-PQ discretization step, achieving approximately $4\times$ better compression than post-hoc quantization at equivalent retrieval quality. We demonstrate that *SAND* enables searching over 10 billion molecules in less than a second on a single GPU node - a speedup of $>10^{8} \times$ compared to traditional methods. We release open-source code and trained weights to facilitate adoption.
Successful Page Load