Biological data is inherently heterogeneous and high-dimensional. Single-cell sequencing of transcripts in a tissue sample generates data for thousands of cells, each of which is characterized by upwards of tens of thousands of genes. How to identify the subsets of cells and genes that are associated with a label of interest remains an open question. In this paper, we integrate a signal-extractive neural network architecture with axiomatic feature attribution to classify tissue samples based on single-cell gene expression profiles. This approach is not only interpretable but also robust to noise, requiring just 5% of genes and 23% of cells in an in silico tissue sample to encode signal in order to distinguish signal from noise with greater than 70% accuracy. We demonstrate its applicability in two real-world settings for discovering cell type-specific chemokine correlates: predicting response to immune checkpoint inhibitors in multiple tissue types and predicting DNA mismatch repair deficiency in colorectal cancer. Our approach not only significantly outperforms traditional machine learning classifiers but also presents actionable biological hypotheses of chemokine-mediated tumor immunogenicity.
Sherry Chao (Harvard)
Michael Brenner (Harvard/Google)
More from the Same Authors
2020 Workshop: ML Interpretability for Scientific Discovery »
Subhashini Venugopalan · Michael Brenner · Scott Linderman · Been Kim