Poster
in
Workshop: Exploration in AI Today (EXAIT)
Stabilizing protein fitness predictors via the PCS framework
Omer Ronen · Alex Zhao · Ron Boger · Chengzhong Ye · Bin Yu
Keywords: [ uncertainty qunatification ] [ protein fitness prediction ] [ bayesian optimization ]
Abstract:
We improve protein fitness prediction by addressing an often-overlooked sourceof instability in machine learning models: the choice of data representation.Guided by the Predictability–Computability–Stability (PCS) framework forveridical (truthful) data science, we construct $\textit{SP}$ (Stable and Pred-checked) predictors byapplying a prediction-based screening procedure (pred-check in PCS) to selectpredictive representations, followed by ensembling models trained on each—thereby leveraging representation-level diversity. This approachimproves predictive accuracy, out-of-distribution generalization, and uncertaintyquantification across a range of model classes. Our SP variant of the recently introducedkernel regression method, Kermut, achieves state-of-the-art performance on theProteinGym supervised fitness prediction benchmark: it reduces mean squared errorby up to 20\% and improves Spearman correlation by up to 10\%, with the largest improvementson splits representing a distribution shift. We further demonstrate that SP predictors yield statistically significant improvements in in-silico proteindesign tasks. Our results highlight the critical role of representation-level variability in fitness prediction and, more broadly, underscore the need to address instability throughout the entire data science lifecycle to advance protein design.
Chat is not available.
Successful Page Load