Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods
Abstract
Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists increasingly demand that these models also elucidate the underlying biological mechanisms. While interpretable machine learning (IML) techniques have been increasingly applied to bridge this gap, there has been a pervasive reliance on anecdotal validation: the vast majority of research employs a single IML method and reports only isolated successful instances. Through a benchmarking study on transcription factor binding, we demonstrate the risks of current practices. We show that different IML methods can often (1) yield contradictory explanations for identical predictions, (2) fail to localize known regulatory motifs, and (3) do not faithfully reflect the model's internal decision process. In light of this, we argue for a validation framework analogous to clinical trials. Just as trials require rigorous design and the reporting of adverse events, genomic interpretability must move beyond cherry-picked plausibility toward systematic assessment of consistency, faithfulness, and biological validity. To facilitate this, we propose a tiered framework to guide the rigorous evaluation and reporting of genomic IML methods.