Workshop

ICML 2021 Workshop on Computational Biology

Yubin Xie, Cassandra Burdziak, Amine Remita, Elham Azizi, Abdoulaye Baniré Diallo, Sandhya Prabhakaran, Debora Marks, Dana Pe'er, Wesley Tansey, Julia Vogt, Engelbert MEPHU NGUIFO, Jaan Altosaar, Anshul Kundaje, Sabeur Aridhi, Bishnu Sarker, Wajdi Dhifli, Alexander Anderson

Abstract:

The ICML Workshop on Computational Biology will highlight how machine learning approaches can be tailored to making discoveries with biological data. Practitioners at the intersection of computation, machine learning, and biology are in a unique position to frame problems in biomedicine, from drug discovery to vaccination risk scores, and the Workshop will showcase such recent research. Commodity lab techniques lead to the proliferation of large complex datasets, and require new methods to interpret these collections of high-dimensional biological data, such as genetic sequences, cellular features or protein structures, and imaging datasets. These data can be used to make new predictions towards clinical response, to uncover new biology, or to aid in drug discovery.
This workshop aims to bring together interdisciplinary machine learning researchers working at the intersection of machine learning and biology that includes areas such as computational genomics; neuroscience; metabolomics; proteomics; bioinformatics; cheminformatics; pathology; radiology; evolutionary biology; population genomics; phenomics; ecology, cancer biology; causality; representation learning and disentanglement to present recent advances and open questions to the machine learning community.
The workshop is a sequel to the WCB workshops we organized in the last five years at ICML, which had excellent line-ups of talks and were well-received by the community. Every year, we received 60+ submissions. After multiple rounds of rigorous reviewing, around 50 submissions were selected from which the best set of papers were chosen for Contributed talks and Spotlights and the rest were invited for Poster presentations. We have a steadfast and growing base of reviewers making up the Program Committee. For two of the previous editions, a special issue of Journal of Computational Biology has been released with extended versions of a selected set of accepted papers.

Chat is not available.

Timezone: »

Schedule

Sat 5:45 a.m. - 5:50 a.m.
Opening Remarks
Sat 5:50 a.m. - 6:25 a.m.
Invited talk 1 - Lessons from the Pandemic for Machine Learning and Medical Imaging (Talk)   
Workshop CompBio, Carola-Bibiane Schönlieb, Mike Roberts
Sat 6:25 a.m. - 6:30 a.m.
Invited Talk 1 Q&A (Q&A)
Sat 6:30 a.m. - 6:45 a.m.
Contributed Talk 1 - Multigrate: single-cell multi-omic data integration (Contributed Talk)   
Workshop CompBio, Mohammad Lotfollahi
Sat 6:45 a.m. - 6:50 a.m.
Contributed Talk 1 Q&A (Q&A)
Sat 6:50 a.m. - 6:55 a.m.
Spotlight Set 1-1 | Statistical correction of input gradients for black box models trained with categorical input features (Spotlight)   
Workshop CompBio, Antonio Majdandzic
Sat 6:55 a.m. - 7:00 a.m.
Spotlight Set 1-2 | Opportunities and Challenges in Designing Genomic Sequences (Spotlight)   
Workshop CompBio, Mengyan Zhang
Sat 7:00 a.m. - 7:05 a.m.
Spotlight Set 1-3 | pmVAE: Learning Interpretable Single-Cell Representations with Pathway Modules (Spotlight)   
Workshop CompBio, Stefan Stark
Sat 7:05 a.m. - 7:10 a.m.
Spotlight Set 1-5 | Deep Contextual Learners for Protein Networks (Spotlight)   
Workshop CompBio, Michelle Li
Sat 7:10 a.m. - 7:15 a.m.
Spotlight Set 1-4 | Multimodal data visualization, denoising and clustering with integrated diffusion (Spotlight)   
Workshop CompBio, MANIK KUCHROO
Sat 7:15 a.m. - 7:30 a.m.
Break 1 (Break)
Sat 7:30 a.m. - 7:31 a.m.
Introduction for Session 2 (Introduction)
Sat 7:31 a.m. - 7:56 a.m.
Invited talk 2 - Anomaly detection to find rare phenotypes (Talk)   
Workshop CompBio, Quaid Morris
Sat 7:56 a.m. - 8:00 a.m.
Invited Talk 2 Q&A (Q&A)
Sat 8:00 a.m. - 8:15 a.m.
Contributed Talk 2 - Light Attention Predicts Protein Location from the Language of Life (Contributed Talk)   
Workshop CompBio, Hannes Stärk
Sat 8:15 a.m. - 8:20 a.m.
Contributed Talk 2 Q&A (Q&A)
Sat 8:20 a.m. - 8:25 a.m.
Highlight 1 | Representation of Features as Images with Neighborhood Dependencies forCompatibility with Convolutional Neural Networks (Paper Highlight)   
Workshop CompBio, Omid Bazgir
Sat 8:25 a.m. - 8:30 a.m.
Highlight 2 | VoroCNN: Deep Convolutional Neural Network Built on 3D Voronoi Tessellation of Protein Structures (Paper Highlight)   
Workshop CompBio, Ilia Igashov
Sat 8:30 a.m. - 8:35 a.m.
Highlight 3 | DIVERSE: Bayesian Data IntegratiVE learning for precise drug ResponSE prediction (Paper Highlight)   
Workshop CompBio, Betul Guvenc Paltun
Sat 8:35 a.m. - 8:40 a.m.
Highlight 4 | Spherical Convolutions on Molecular Graphs for Protein Model Quality Assessment (Paper Highlight)   
Workshop CompBio, Nikita Pavlichenko
Sat 8:40 a.m. - 8:45 a.m.
Highlight 5 | Data-driven Experimental Prioritization via Imputation and Submodular Optimization (Paper Highlight)   
Workshop CompBio, Jacob Schreiber
Sat 8:45 a.m. - 8:50 a.m.
Highlight 6 | Data Inequality, Machine Learning and Health Disparity (Paper Highlight)   
Workshop CompBio, Yan Gao
Sat 8:50 a.m. - 8:55 a.m.
Highlight 7 | Deep neural networks identify sequence context features predictive of transcription factor binding (Paper Highlight)   
Workshop CompBio, AN ZHENG
Sat 9:00 a.m. - 10:00 a.m.
 link »

Please join Gather town Room 1 to explore posters of Session 1 and interact with authors and other attendees. Link: https://eventhosts.gather.town/oTvLwqGTGVzInPP8/compbio-w-poster-room-1

Sat 10:00 a.m. - 11:00 a.m.
 link »

Please join Gather town Room 2 to explore posters of Session 2 and interact with authors and other attendees. Link : https://eventhosts.gather.town/qSNgxsEtPEkhMjXk/compbio-w-poster-room-2

Sat 11:00 a.m. - 11:01 a.m.
Introduction for Session 3 (Introduction)
Sat 11:01 a.m. - 11:26 a.m.
Invited talk 3 - Every Patient Deserves Their Own Equation (Talk)   
Workshop CompBio
Sat 11:26 a.m. - 11:30 a.m.
Invited Talk 3 Q&A (Q&A)
Sat 11:30 a.m. - 11:45 a.m.
Contributed Talk 3 - Reconstructing unobserved cellular states from paired single-cell lineage tracing and transcriptomics data (Contributed Talk)   
Workshop CompBio, Khalil Ouardini
Sat 11:45 a.m. - 11:50 a.m.
Contributed Talk 3 Q&A (Q&A)
Sat 11:50 a.m. - 11:55 a.m.
Spotlight Set 2-1 | Equivariant Graph Neural Networks for 3D Macromolecular Structure (Spotlight)   
Workshop CompBio, Bowen Jing
Sat 11:55 a.m. - 12:00 p.m.
Spotlight Set 2-2 | Viral Evolution and Antibody Escape Mutations using Deep Generative Models (Spotlight)   
Workshop CompBio, Nicole Thadani
Sat 12:00 p.m. - 12:05 p.m.
Spotlight Set 2-3 | Multi-Scale Representation Learning on Proteins (Spotlight)   
Workshop CompBio, Charlotte Bunne
Sat 12:05 p.m. - 12:10 p.m.
Spotlight Set 2-4 | Immuno-mimetic Deep Neural Networks (Immuno-Net) (Spotlight)   
Workshop CompBio, Ren Wang
Sat 12:10 p.m. - 12:15 p.m.
Spotlight Set 2-5 | Gene expression evolution across species, organs and sexes in Drosophila (Spotlight)   
Workshop CompBio, Soumitra Pal
Sat 12:15 p.m. - 1:15 p.m.
 link »

Please join Gather town Room 3 to explore posters of Session 3 and interact with authors and other attendees. Link : https://eventhosts.gather.town/lSxhSe2i0AEPMotS/compbio-w-poster-room-3

Sat 1:15 p.m. - 1:16 p.m.
Introduction for Session 4 (Introduction)
Sat 1:16 p.m. - 1:41 p.m.
Invited talk 4 - Learning from evolution (Talk)   
Workshop CompBio, blanchem
Sat 1:41 p.m. - 1:45 p.m.
Invited Talk 4 Q&A (Q&A)
Sat 1:45 p.m. - 2:00 p.m.
Contributed Talk 4 - A Bayesian Mutation-Selection Model of Evolutionary Constraints on Coding Sequences (Contributed Talk)   
Workshop CompBio, Berk Alpay
Sat 2:00 p.m. - 2:05 p.m.
Contributed Talk 4 Q&A (Q&A)
Sat 2:05 p.m. - 2:20 p.m.
Closing Remarks & Awards Ceremony (Closing Remarks)   
-
[ Visit Poster at Spot A6 in Virtual World ]

Biomimetics has played a key role in the evolution of artificial neural networks. Thus far, in silico metaphors have been dominated by concepts from neuroscience and cognitive psychology. In this paper we introduce a different type of biomimetic model, one that borrows concepts from the immune system, for designing robust deep neural networks. This immuno-mimetic model leads to a new computational biology framework for robustification of deep neural networks against covariate shifts and adversarial attacks. Within this Immuno-Net framework we define a robust adaptive immune-inspired learning system (Immuno-Net RAILS) that emulates the in silico the adaptive biological mechanisms of B-cells that are used to defend a mammalian host against pathogenic attacks. When applied to image classification tasks on benchmark datasets, we demonstrate that Immuno-net RAILS results in improvement of as much as 12.5% in adversarial accuracy of a baseline method, the DkNN-robustified CNN, without appreciable loss of accuracy on clean data.

Ren Wang
-
[ Visit Poster at Spot A5 in Virtual World ]

It is known that more than 100,000 detectable peptide species elute in single shotgun proteomics runs. The mass spectrometer, however, only selects a small subset of most abundant peptides for sequencing at each survey scan in a run. This compromises consistent quantification of peptides across runs, leading to the prevalent problem of missing values. When a peptide is identified by sequencing, its MS1 measurements are known to the experimenter. Therefore, peptide identities can be transferred between runs based on similarity of MS1 attributes. The accuracy of the existing approaches to peptide identity propagation (PIP) is limited by the selection of runs used as reference for information propagation, and the specified tolerance in deviation of MS1 measurements. These approaches are also inherently limited by the lack of probability measure to assign confidence and filter likely false positive results. We propose to learn the identity of query peptides by mapping them to a latent space of peptide MS1 representations. We then use this embedding space to propagate sequence information between runs. We observed that peptide sequences can have very small occurrences, so the embedding network had to be learned by few-shot learning frameworks. We also observed that the same peptide can occur at different retention gradient time in different studies, hampering the correct identification of peptides. We addressed this challenge by modifying the loss function of prototypical networks. We demonstrated that embedding MS1 attributes of the peptides and propagating sequence information on the embedding space can improve recovery of low abundance peptides in a small cancer dataset.

Soroor Hediyeh-zadeh
-
[ Visit Poster at Spot B2 in Virtual World ]

Spatial transcriptomic profiling allows studying the heterogeneity of cell types and their spatial distribution in the context of the tissue microenvironment. However, current high-throughput spatial transcriptomic technologies are low resolution, i.e. measurement from each capture location involves a mixture of multiple cells. This problem hinders downstream analysis of intercellular interactions especially in complex tissues such as tumors. We propose two approaches for decomposing spatial transcriptomics without the need for paired single-cell RNA-seq data: non-negative matrix factorization for unsupervised discovery of major cell types, and a semi-supervised autoencoder model for further separation of cell states with incorporation of known marker gene-sets as prior knowledge regularization. We present preliminary insights into tumor-immune interactions in breast cancer tumors and benchmark performance on spatial data simulated from single-cell peripheral blood cells.

xueer chen
-
[ Visit Poster at Spot A6 in Virtual World ]

Effective use of evolutionary information has recently led to tremendous progress in computational prediction of three-dimensional (3D) structures of proteins and their complexes. Despite the progress, the accuracy of predicted structures tends to vary considerably from case to case. Since the utility of computational models depends on their accuracy, reliable estimates of deviation between predicted and native structures are of utmost importance. For the first time, we present a deep convolutional neural network (CNN) constructed on a Voronoi tessellation of 3D molecular structures. Despite the irregular data domain, our data representation allows us to efficiently introduce both convolution and pooling operations and train the network in an end-to-end fashion without precomputed descriptors. The resultant model, VoroCNN, predicts local qualities of 3D protein folds. The prediction results are competitive to state of the art and superior to the previous 3D CNN architectures built for the same task. We also discuss practical applications of VoroCNN, for example, in recognition of protein binding interfaces. In the past blind protein structure prediction challenge CASP14, VoroCNN was ranked as the second protein model quality assessment method according to several evaluation metrics among more than 70 methods. The code is available at https://team.inria.fr/nano-d/software/vorocnn/.

Ilia Igashov
-
[ Visit Poster at Spot B0 in Virtual World ]

We propose Epiphany, a light-weight neural network to predict the Hi-C contact map from five commonly generated epigenomic tracks: DNase I hypersensitive sites and CTCF, H3K27ac, H3K27me3, and H3K4me3 ChIP-seq. Epiphany uses 1D convolutional layers to learn local representations from the input tracks as well as bidirectional Long Short Term Memory (Bi-LSTM) layers to capture long term dependencies along the epigenome. To improve the usability of predicted contact matrices, we perform statistically principled preprocessing of Hi-C data using HiC-DC+ \cite{HiCDC+} and train Epiphany using an adversarial loss, enhancing its ability to produce realistic Hi-C contact maps for downstream analysis. We show that Epiphany generalizes to held-out chromosomes within and across cell types, and that Epiphany's predicted contact matrices yield accurate TAD and significant interaction calls.

Rui Yang
-
[ Visit Poster at Spot B1 in Virtual World ]

In biomedical applications, patients are often profiled with multiple technologies or assays to produce a multiomics or multiview biological dataset. A challenge in collecting these datasets is that there are often entire views or individual features missing, which can significantly limit the accuracy of downstream tasks, such as, predicting a patient phenotype. Here, we propose a multiview based deep generative adversarial data imputation model (MultImp). MultImp improves imputation quality and disease subtype classification accuracy in comparison to several baseline methods across two multiomics datasets.

Yining Jiao
-
[ Visit Poster at Spot B2 in Virtual World ]

Processing information on 3D objects requires methods stable to rigid-body transformations, in particular rotations, of the input data. In image processing tasks, convolutional neural networks achieve this property using rotation-equivariant operations. However, contrary to images, graphs generally have irregular topology. This makes it challenging to define a rotation-equivariant convolution operation on these structures. In this work, we propose Spherical Graph Convolutional Network (S-GCN) that processes 3D models of proteins represented as molecular graphs. In a protein molecule, individual amino acids have common topological elements. This allows us to unambiguously associate each amino acid with a local coordinate system and construct rotation-equivariant spherical filters that operate on angular information between graph nodes. Within the framework of the protein model quality assessment problem, we demonstrate that the proposed spherical convolution method significantly improves the quality of model assessment compared to the standard message-passing approach. It is also comparable to state-of-the-art methods, as we demonstrate on Critical Assessment of Structure Prediction (CASP) benchmarks. The proposed technique operates only on geometric features of protein 3D models. This makes it universal and applicable to any other geometric-learning task where the graph structure allows constructing local coordinate systems. The method is available at https://team.inria.fr/nano-d/software/s-gcn/.

Nikita Pavlichenko
-
[ Visit Poster at Spot B3 in Virtual World ]

Single-cell multimodal omics technologies provide a holistic approach to study cellular decision making. Yet, learning from multimodal data is complicated because of missing and incomplete reference samples, nonoverlapping features and batch effects between datasets. To integrate and provide a unified view of multi-modal datasets, we propose Multigrate. Multigrate is a generative multi-view neural network to build multimodal reference atlases. In contrast to existing methods, Multigrate is not limited to specific paired assays while comparing favorably to existing data-specific methods on both integration and imputation tasks. We further show that Multigrate equipped with transfer learning enables mapping a query multimodal dataset into an existing reference atlas.

Nastja Litinetskaya
-
[ Visit Poster at Spot B4 in Virtual World ]

T cells play a pivotal role in the adaptive immune system recognizing foreign antigens through their T-cell receptor (TCR). Although the specificity and affinity of the TCR to its cognate antigen determines the functionality, the phenotypic differentiation and thereby also the fate of the T cell remain poorly understood. Therefore, studying the transcriptional changes of T cells in the context of their TCRs is key to deeper insights into T-cell biology. To this end, we developed a multi-view Variational Autoencoder (mvTCR) to jointly embed transcriptomic and TCR sequence information at a single-cell level to better capture the phenotypic behavior of T cells. We evaluated mvTCR on two datasets showing a clear separation of the cell state and their functionality, thus, providing a more biologically informative representation than models using each modality individually.

Felix Drost
-
[ Visit Poster at Spot B5 in Virtual World ]

Cellular and tissue context is central to understanding health and disease, yet it is often inaccessible to machine learning analysis. In particular, protein interaction (PPI) networks have facilitated the discovery of disease mechanisms and candidate therapeutic targets; however, they are constructed by experiments in which much of the cellular and tissue context is removed. Given the advancements in single-cell sequencing technology, the limitations of homogeneous and static PPI networks are becoming more apparent. Here, we have developed a deep graph representation learning framework, AWARE, to inject cellular and tissue context into protein embeddings. AWARE optimizes for a multi-scale embedding space, whose structure reflects both intricate PPI connectivity patterns as well as cellular and tissue organization. We construct a multi-scale data representation of the Human Cell Atlas, and apply AWARE to learn protein, cell type, and tissue embeddings that uphold biological cell type and tissue hierarchies. We demonstrate the utility of such embeddings on the novel task of elucidating cell type specific disease-gene associations. For predicting the contributions of genes to diseases in different cell types, our AWARE protein embeddings outperform global PPI network embeddings by at least 12.5%, highlighting the importance of contextual embeddings for biomedicine.

Michelle Li
-
[ Visit Poster at Spot C0 in Virtual World ]

Single-cell multi-omics technology is able to measure multiple data modalities at single cell resolution, such as gene expression level (using single cell RNA-sequencing) and chromatin accessibility (using single cell ATAC-sequencing). Integrating scRNA-seq and scATAC-seq data profiled from different cells is a challenging problem. Existing methods often require that the scRNA-seq and scATAC-seq data cover the same cell types, that is, the same clusters. However, this is often not true for many existing datasets. Here we propose a joint matrix tri-factorization algorithm scJMT that is capable of integrating and clustering cells from both modalities of data in the case where the two data modalities do not share exactly the same cell types. The tri-factorization framework also allows us to obtain clusters of genes and chromatin regions, and the association matrices between cell clusters and gene or region clusters. We show that scJMT is superior to a state-of-the-art method under both scenarios where the two modalities have the same or different cluster compositions.

Ziqi Zhang
-
[ Visit Poster at Spot B3 in Virtual World ]
We present low-dimensional latent representations learnt by the $\beta-$VAEs from the graph-topological structures encoding pharmacophoric features. The controlled information compression of these molecular fingerprints effectively removes the ambiguous redundancies and consequently results in encoding the chemically semantic latents. This latent molecular semantics allows for various tasks, from molecular similarity assessment to better-targeted search of the chemical space and drug discovery. We investigate the performance of the learnt latents of various dimensions on the ligand-based virtual screening task.
Andrea Karlova, Andrea Karlova
-
[ Visit Poster at Spot C1 in Virtual World ]

Existing single-cell (SC) datasets are very limited by their number of donors (individuals). As a result, most of the current research in SC genomics focuses on studying biological processes that are broadly conserved across individuals, such as cellular organization and tissue development. While studying such biology from a limited number of donors is possible in principle due to the expected high consistency across donors, advancing our understanding of heterogeneous conditions that demonstrate molecular variation across individuals requires population-level data. Particularly, probing the etiology of complex and heterogeneous variation that may be inconsistent across individuals owing to molecular variation is expected to require population-level information. We developed ``kernel of integrated single cells'' (Keris), a novel framework to inform the analysis of SC gene expression data with population-level variation. By inferring cell-type-specific moments and their variation with conditions using large tissue-level bulk data representing a population, Keris allows us to generate testable hypotheses at the SC level that would normally require collecting SC data from a large number of donors. Here, we demonstrate how such integration of low-resolution but large bulk data with small but high-resolution SC data enables the identification and study of systematic gradients of variation in gene-gene interactions across cells.

Elior Rahmani
-
[ Visit Poster at Spot B0 in Virtual World ]
Link prediction, which is to predict the existence of a link/edge between two vertices in a graph, is a classical problem in machine learning. Intuitively, if it takes a long distance to walk from $u$ to $v$ along the existing edges, there should not be a link between them, and vice versa. This motivates us to explicitly combine the distance information with graph neural networks (GNNs) to improve link prediction. Calculating the distances between any two vertices (e.g., shortest path, expectation of random walk) in training is time consuming. To overcome this difficulty, we propose an anchor-based distance: First, we randomly select $K$ anchor vertices from the graph and then calculate the shortest distances of all vertices in the graph to them. The distance between vertices $u$ and $v$ is estimated as the average of their distances to the $K$ anchor vertices. After that, we feed the distance into the GNN module. Our method brings significant improvement for link prediction with few additional parameters. We achieved state-of-the-art result on the drug-drug-interaction (i.e., DDI) and protein-protein-association (i.e., PPA) tasks of OGB~\cite{hu2020ogb}. Our code is available at \url{https://github.com/lbn187/DLGNN}.
Yingce Xia
-
[ Visit Poster at Spot B1 in Virtual World ]

Mutations in viruses can result in zoonosis, immune escape, and changes in pathology. To control evolving pandemics, we wish to predict likely trajectories of virus evolution. Here we predict the probability of SARS-CoV-2 protein variants by using deep generative models to capture constraints on broader evolution of coronavirus sequences. We validate against lab measurements of mutant effects on replication and molecular function (e.g. receptor binding). We then apply our predictor to evaluate the potential of mutational escape from known antibodies, a strategy which can facilitate the development of antibody therapeutics and vaccines to mitigate immune evasion.

Nicole Thadani
-
[ Visit Poster at Spot C3 in Virtual World ]

Gradients of a model's prediction with respect to the inputs are used in a variety of downstream analyses for deep neural networks (DNNs). Examples include post hoc explanations with attribution methods. In many tasks, DNNs are trained on categorical input features subject to value constraints - a notable example is DNA sequences, where input values are subject to a probabilistic simplex constraint from the 1-hot encoded data. Here we observe that outside of this simplex, where no data points anchor the function during training, the learned function can exhibit erratic behaviors. Thus, the gradients can have arbitrary directions away from the data simplex, which manifests as noise in gradients. This can introduce significant errors to downstream applications that rely on input gradients, such as attribution maps. We introduce a simple correction for this off-simplex-derived noise and demonstrate its effectiveness quantitatively and qualitatively for DNNs trained on regulatory genomics data. We find that our correction consistently leads to a small, but significant improvement in gradient-based attribution scores, especially when the direction of the gradients deviates significantly from the simplex.

Antonio Majdandzic
-
[ Visit Poster at Spot B1 in Virtual World ]

Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expertly chosen input features leveraging evolutionary information that is resource expensive to generate. We showcase using embeddings from protein language models for competitive localization predictions not relying on evolutionary information. Our lightweight deep neural network architecture uses a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention (LA). The method significantly outperformed the state-of-the-art for ten localization classes by about eight percentage points (Q10). The novel models are available as a web-service and as a stand-alone application at http://embed.protein.properties.

Hannes Stärk
-
[ Visit Poster at Spot B6 in Virtual World ]

An unfortunate reality is that modern science is often limited by the number of experiments that one can afford to perform. When faced with budget constraints, choosing the most informative set of experiments sometimes requires intuition and guess-work. Here, we describe a data-driven method for prioritizing experimentation given a fixed budget. This method involves first predicting the readout for each hypothetical experiment and, second, using submodular optimization to choose a minimally redundant set of hypothetical experiments based on these predictions. This approach has several strengths, including the ability to incorporate soft and hard constraints into the optimization, account for experiments that have already been performed, and weight each experiment based on anticipated usefulness or actual cost. Software for this system applied to the ENCODE Compendium can be found at https://github.com/jmschrei/kiwano.

Jacob Schreiber
-
[ Visit Poster at Spot A1 in Virtual World ]

The experimental design based on black-box optimization and batch recommendation has been increasingly used for the design of genetic sequences. We briefly outline our recent results on using Bayesian optimization to maximise gene expression in bacteria, where machine learning enabled us to discover a strong regulatory element. Using the Design-Build-Test-Learn (DBTL) workflow as a case study of how to effectively use machine learning in genomic sequence design, we argue that machine learning has tremendous potential in this area. Based on our experience, we discuss several opportunities and challenges that we have identified, and conclude with a call to action for more collaborations.

Mengyan Zhang
-
[ Visit Poster at Spot A2 in Virtual World ]

Over 80% of clinical genetics and omics data were collected from individuals of European ancestry (EA), which comprise approximately 16% of the world’s population. This severe data disadvantage for the non-EA populations is set to generate new health disparities as machine learning powered biomedical research and health care become increasingly common. The new health disparity arising from data inequality can potentially impact all data-disadvantaged ethnic groups in all diseases where data inequality exists. Thus, its negative impact is not limited to the diseases for which significant racial/ethnic disparities have already been evident. In a recent work, we showed that the current prevalent scheme for machine learning with multiethnic data, the mixture learning scheme, and its main alternative, the independent learning scheme, are prone to generating machine learning models with relatively low performance for data-disadvantaged ethnic groups due to inadequate training data and data distribution discrepancies among ethnic groups. We found that transfer learning can provide improved machine learning models for data-disadvantaged ethnic groups by leveraging knowledge learned from other groups having more abundant data. These results indicate that transfer learning can provide an effective approach to reduce health care disparities arising from data inequality among ethnic groups.

Yan Gao
-
[ Visit Poster at Spot A3 in Virtual World ]

Genetic mutations can cause disease by disrupting normal gene function. Identifying the disease-causing mutations from millions of genetic variants within an individual patient is a challenging problem. Computational methods which can prioritize disease-causing mutations have, therefore, enormous applications. It is well-known that genes function through a complex regulatory network. However, existing variant effect prediction models only consider a variant in isolation. In contrast, we propose VEGN, which models variant effect prediction using a graph neural network (GNN) that operates on a heterogeneous graph with genes and variants. The graph is created by assigning variants to genes and connecting genes with an gene-gene interaction network. In this context, we explore an approach where a gene-gene graph is given and another where VEGN learns the gene-gene graph and therefore operates both on given and learnt edges. The graph neural network is trained to aggregate information between genes, and between genes and variants. Variants can exchange information via the genes they connect to. This approach improves the performance of existing state-of-the-art models.

Carolin Lawrence
-
[ Visit Poster at Spot A4 in Virtual World ]

Current neural decoding methods typically aim at explaining behavior based on neural activity via supervised learning. However, since generally there is a strong connection between learning of subjects and their expectations on long-term rewards, we hypothesize that extracting an intrinsic reward function as an intermediate step will lead to better generalization and improved decoding performance. We use inverse reinforcement learning to infer an intrinsic reward function underlying a behavior in closed form, and associate it with neural activity in an approach we call NeuRL. We study the behavior of rats in a response-preparation task and evaluate the performance of NeuRL within simulated inhibition and per-trial behavior prediction. By assigning clear functional roles to defined neuronal populations our approach offers a new interpretation tool for complex neuronal data with testable predictions. In per-trial behavior prediction, our approach furthermore improves accuracy by up to 15% compared to traditional methods.

Joschka Boedecker
-
[ Visit Poster at Spot A0 in Virtual World ]

The accumulation of sequencing data has enabled researchers to predict the interactions between RNA sequences and RNA-binding proteins (RBPs) using novel machine learning techniques. However, existing models are often difficult to interpret and require additional information to sequences. Bidirectional encoder representations from Transformer (BERT) is a language-based deep learning model that is highly interpretable; therefore, a model based on BERT architecture can potentially overcome such limitations. Here, we propose BERT-RBP as a model to predict RNA-RBP interactions by adapting the BERT architecture pre-trained on a human reference genome. Our model outperformed state-of-the-art prediction models using the eCLIP-seq data of 154 RBPs. The detailed analysis further revealed that BERT-RBP could recognize the transcript region type only from sequential information. Overall, the results provide insights into the mechanism of BERT in biological contexts and provide evidence of the applicability of the model to other RNA-related problems.

Keisuke Yamada
-
[ Visit Poster at Spot A2 in Virtual World ]

We introduce a new dataset called Synthetic COVID-19 Chest X-ray Dataset for training machine learning models. The dataset consists of 21,295 synthetic COVID-19 chest X-ray images to be used for computer-aided diagnosis. These images, generated via an unsupervised domain adaptation approach, are of high quality. We find that the synthetic images not only improve performance of various deep learning architectures when used as additional training data under heavy imbalance conditions (skew>90), but also detect the target class with high confidence. We also find that comparable performance can also be achieved when trained only on synthetic images. Further, salient features of the synthetic COVID-19 images indicate that the distribution is significantly different from Non-COVID-19 classes, enabling a proper decision boundary. We hope the availability of such high fidelity chest X-ray images of COVID-19 will encourage advances in the development of diagnostic and/or management tools.

Hasib Zunair
-
[ Visit Poster at Spot A1 in Virtual World ]

Representing and reasoning about 3D structures of macromolecules is emerging as a distinct challenge in machine learning. Here, we extend recent work on geometric vector perceptrons and apply equivariant graph neural networks to a wide range of tasks from structural biology. Our method outperforms all reference architectures on three out of eight tasks in the ATOM3D benchmark, is tied for first on two others, and is competitive with equivariant networks using higher-order representations and spherical harmonic convolutions. In addition, we demonstrate that transfer learning can further improve performance on certain downstream tasks. Code is available at https://github.com/drorlab/gvp-pytorch.

Bowen Jing
-
[ Visit Poster at Spot A5 in Virtual World ]

In recent years the widespread availability of single cell RNA sequencing (scRNA-seq) technology has led to the curation and distribution of diverse and comprehensive cell atlases and reference datasets, where cell identities are annotated by expert biologists. Many probabilistic and non-probabilistic solutions were developed that utilise the existing annotated datasets as “reference” to characterise cells in newly acquired datasets in supervised or unsupervised ways. What these methods are unable to do, however, is to characterise molecular phenotypes that are best studied by bulk RNA sequencing or microarray technologies that precede scRNA-seq. Examples include breast cancer subtypes that are determined using a defined set of fifty genes, the PAM50 gene signatures. In fact, several geneset and cell marker databases have emerged from published bulk and single cell RNA sequencing studies, where phenotypes, transcriptional programs (e.g. signaling pathways) and cell types are only described by a list of genes, with no numerical attributes. We are interested in the problem of mapping phenotype and cell type similarities in single cell RNAseq from a collection of genesets or makers. We developed scDECAF, which uses vector space model to label cells in a dataset. Gene lists are mapped to a common, shared latent space with single cell gene expression profiles where the correlation between expression profile of the cells and the pattern defined by the genesets is maximised. The latent spaces are determined using Canonical Correlation Analysis (CCA). The association between the cells and genesets is determined by the proximity of their representations in the CCA space and the transcriptome embedding space, resulting in annotation of the cells and associations with phenotypes. We have additionally developed a framework for selection of biologically relevant genesets when large geneset collections are examined. Our results suggest that scDECAF has comparable performance to reference-based cell type annotation methods, and it able to recover the known transcriptional programs in scRNAseq datasets.

Soroor Hediyeh-zadeh
-
[ Visit Poster at Spot A2 in Virtual World ]

Combining different modalities of data from human tissues has been critical in advancing biomedical research and personalised medical care. In this study, we leverage a graph embedding model (i.e VGAE) to perform link prediction on tissue-specific gene-gene-interaction (GGI) networks. Through ablation experiments, we prove that the combination of multiple biological modalities (i.e multi-omics) leads to powerful embeddings and better link prediction performances. Our evaluation shows that the integration of gene methylation profiles and RNA-sequencing data significantly improves the link prediction performance. Overall, the combination of RNA-sequencing and gene methylation data leads to a link prediction accuracy of 71\% on the GGI networks. By harnessing graph representation learning on multi-omics data, our work brings novel insights to the current literature on multi-omics integration in bioinformatics.

Amine Amor
-
[ Visit Poster at Spot A3 in Virtual World ]
We propose a resampling-based fast variable selection technique for selecting important Single Nucleotide Polymorphisms (SNP) in multi-marker mixed effect models used in twin studies. Due to computational complexity, current practice includes testing the effect of one SNP at a time, commonly termed as `single SNP association analysis'. Joint modeling of genetic variants within a gene or pathway may have better power to detect the relevant genetic variants, hence we adapt our recently proposed framework of $e$-values to address this. In this paper, we propose a computationally efficient approach for single SNP detection in families while utilizing information on multiple SNPs simultaneously. We achieve this through improvements in two aspects. First, unlike other model selection techniques, our method only requires training a model with all possible predictors. Second, we utilize a fast and scalable bootstrap procedure that only requires Monte-Carlo sampling to obtain bootstrapped copies of the estimated vector of coefficients. Using this bootstrap sample, we obtain the $e$-value for each SNP, and select SNPs having $e$-values below a threshold. We illustrate through numerical studies that our method is more effective in detecting SNPs associated with a trait than either single-marker analysis using family data or model selection methods that ignore the familial dependency structure. We also use the $e$-values to perform gene-level analysis in nuclear families and detect several SNPs that have been implicated to be associated with alcohol consumption.
Subho Majumdar
-
[ Visit Poster at Spot A3 in Virtual World ]

Transcription factors bind DNA by recognizing specific sequence motifs, which are typically 6–12 bp long. A motif can occur many thousands of times in the human genome, but only a subset of those sites are actually bound. Here we present a machine-learning framework leveraging existing convolutional neural network architectures and model interpretation techniques to identify and interpret sequence context features most important for predicting whether a particular motif instance will be bound. We apply our framework to predict binding at motifs for 38 transcription factors in a lymphoblastoid cell line, score the importance of context sequences at base-pair resolution and characterize context features most predictive of binding. We find that the choice of training data heavily influences classification accuracy and the relative importance of features such as open chromatin. Overall, our framework enables novel insights into features predictive of transcription factor binding and is likely to inform future deep learning applications to interpret non-coding genetic variants.

AN ZHENG
-
[ Visit Poster at Spot A4 in Virtual World ]

Antibodies are proteins in the immune system which bind to antigens to mark them out for destruction. The binding sites in an antibody-antigen interaction are known as the paratope and epitope, respectively, and the prediction of these regions is key to vaccine and synthetic antibody development. Contrary to prior art, we argue that paratope and epitope predictors require asymmetric treatment, and propose distinct neural message passing architectures that are geared towards the specific aspects of paratope and epitope prediction, respectively. We obtain significant improvements on both tasks, setting the new state-of-the-art and recovering favourable qualitative predictions on antigens of relevance to COVID-19.

Alice Del Vecchio
-
[ Visit Poster at Spot A5 in Virtual World ]

Hybrid networks that build upon convolutional layers with attention mechanisms have demonstrated improved performance relative to pure convolutional networks across many regulatory genomic prediction tasks. Their inductive bias to learn long-range interactions provides an avenue to identify learned motif-motif interactions. For attention maps to be interpretable, the convolutional layer(s) must learn identifiable motifs. Here we systematically investigate the extent that architectural choices in convolution-based hybrid networks influence learned motif representations in first layer filters, as well as the reliability of their attribution maps generated by saliency analysis. We find that design principles previously identified in standard convolutional networks also generalize to hybrid networks. This work provides an avenue to narrow the spectrum of architectural choices when designing hybrid networks such that they are amenable to commonly used interpretability methods in genomics.

Rohan Ghotra
-
[ Visit Poster at Spot A4 in Virtual World ]

Protein-ligand complex structures have been utilised to design benchmark machine learning methods that perform important tasks related to drug design such as receptor binding site detection, small molecule docking and binding affinity prediction. However, these methods are usually trained on only ligand bound (or holo) conformations of the protein and therefore are not guaranteed to perform well when the protein structure is in its native unbound conformation (or apo), which is usually the conformation available for a newly identified receptor. A primary reason for this is that the local structure of the binding site usually changes upon ligand binding. To facilitate solutions for this problem, we propose a dataset called APObind that aims to provide apo conformations of proteins present in the PDBbind dataset, a popular dataset used in drug design. Furthermore, we explore the performance of methods specific to three use cases on this dataset, through which, the importance of validating them on the APObind dataset is demonstrated.

Rishal Aggarwal
-
[ Visit Poster at Spot B2 in Virtual World ]

Recent advances in single-cell sequencing and CRISPR/Cas9-based genome engineering have enabled the simultaneous profiling of single-cell lineage and transcriptomic state. Together, these simultaneous assays allow researchers to build comprehensive phylogenetic models relating all cells and infer transcriptomic determinants of subclonal behavior. Yet, these assays are limited by the fact that researchers only have access to direct observations at the leaves of these phylogenies and thus cannot rigorously form hypotheses about unobserved, or ancestral, states that gave rise to the observed population. Here, we introduce TreeVAE: a framework that jointly models the observed transcriptomic states using a variational autoencoder (VAE) and the correlations between observations specified by the tree. Using simulations, we demonstrate that TreeVAE outperforms benchmarks in reconstructing ancestral states on several metrics. Moreover, using real data from lung cancer metastasis single-cell lineage tracing, we show that TreeVAE outperforms state-of-the-art models for scRNA-seq data in terms of goodness of fit. TreeVAE appears as a promising model for taking into account correlations between samples within the framework of deep generative models for transcriptomics data, and produces rigorous reconstructions of unobserved cellular states.

Khalil Ouardini
-
[ Visit Poster at Spot B5 in Virtual World ]

While DNA sequence evolution is well studied, the evolution of gene expression is currently less understood. Though there exist theoretical models to study the evolution of continuous traits such as gene expression, these methods often cannot confidently distinguish alternative evolutionary scenarios, probably, due to the modest numbers of species studied. We hypothesized that biological replicates could increase the predictive power of these models and accordingly developed EvoGeneX, a computationally efficient method based on the Ornstein-Uhlenbeck process to infer the mode of expression evolution. Furthermore, we used Michaelis-Menten equation to compare the evolutionary dynamics across groups of genes in terms of asymptotic level and rate of divergence. We applied these new tools to preform the first ever analysis of the of expression evolution across body parts, species, and sexes of the Drosophila genus. Our analysis revealed that neutral expression evolution can be confidently rejected in favor of stabilizing selection in nearly half of the genes. In addition, neutrally evolving genes evolve fastest in male gonads and faster in female gonads than other body parts. The gonads also have the largest sets of unique adaptive genes. We also detected interesting examples of adaptive genes including Glutamine Synthases and odor binding proteins.

Soumitra Pal
-
[ Visit Poster at Spot B5 in Virtual World ]

Bayesian optimization, which uses a probabilistic surrogate for an expensive black-box function, provides a framework for protein design that requires a small amount of labeled data. In this paper, we compare three approaches to constructing surrogate models for protein design on synthetic benchmarks. We find that neural network ensembles trained directly on primary sequences outperform string kernel Gaussian processes and models built on pretrained embeddings. We show that this superior performance is likely due to improved robustness on out of distribution data. Transferring these insights into practice, we apply our approach to optimizing the Stoke's shift of green fluorescent protein, discovering and synthesizing novel variants with improved functional properties.

Nate Gruver
-
[ Visit Poster at Spot C0 in Virtual World ]

The active global SARS-CoV-2 pandemic caused more than 167 million cases and 3.4 million deaths worldwide. As mentioned by Ye et al.(2021), the development of completely new drugs for such a novel disease is a challenging, time intensive process and despite researchers around the world working on this task, no effective treatments have been developed yet. This emphasizes the importance of drug repurposing, where treatments found among existing drugs for meant different diseases. A common approach to this is based on knowledge graphs, that condense relationships between entities like drugs, diseases and genes. Graph neural networks (GNNs) can then be used for the task at hand by predicting links in such knowledge graphs. Expanding on state-of-the-art GNN research, Doshi & Chepuri (2020) originally presented the model DR-COVID. We further extend their work using additional output interpretation strategies. The best aggregation strategy derives a top-100 ranking of candidate drugs, 32of which currently being in COVID-19-related clinical trials. Moreover, we present an alternative application for the model, the generation of additional candidates based on a given pre-selection of drug candidates using collaborative filtering. In addition, we improved the implementation of the model by Doshi & Chepuri (2020) by significantly shortening the inference and pre-processing time by exploiting data-parallelism.

Martin Taraz
-
[ Visit Poster at Spot B6 in Virtual World ]

Deep generative models have shown promising results in protein sequence modeling given their ability to learn distribution over complex high-dimensional spaces. However, tools for analyzing the rich representations they are learning remain limited. We present a methodology for analyzing the latent representations of such models, and show how this analysis can be used to make predictions about ligand interactions and downstream signalling for a clinically important and functionally diverse family of membrane proteins, the G-protein coupled receptors.

Lood van Niekerk
-
[ Visit Poster at Spot C1 in Virtual World ]

Multimodal data is rapidly growing in single-cell biology and other fields of science and engineering. We introduce MultiMAP, an approach for dimensionality reduction and integration of multiple datasets. MultiMAP is a nonlinear manifold learning technique that recovers a single manifold on which all datasets reside and then projects the data into a single low-dimensional space so as to preserve the manifold structure. MultiMAP has several advantages over existing integration strategies for single-cell data, including that it can integrate any number of datasets, leverages features that are not present in all datasets (i.e. datasets can be of different dimensionalities), is not restricted to a linear mapping, allows the user to specify the influence of each dataset on the embedding, and is extremely scalable to large datasets. We apply MultiMAP to the integration of a variety of single-cell transcriptomics, chromatin accessibility, methylation, and spatial data, and show that it outperforms current approaches in preservation of high-dimensional structure, alignment of datasets, visual separation of clusters, transfer learning, and runtime.

Mika Jain
-
[ Visit Poster at Spot B6 in Virtual World ]

The recent emerging techniques of single cell spatial RNA seq makes it possible to profile the transcriptomics data at single cell resolution without loss of the spatial information. However, it is still a challenge to measure epigenomics profiles at spatial levels. In this project, we developed an autoencoder based multi-omics integration method and applied it on spatial mouse fetal brain data to reconstruct the spatial epigenomics profiles. We compared our method with LIGER and showed its better performance on a public dataset measured by latent mixing metrics. We further developed a CNN model to predict autism risk genes based on the spatial RNA seq data. Our model is able to prioritize autism risk genes from whole genome level.

Guojie Zhong
-
[ Visit Poster at Spot A0 in Virtual World ]

Deep learning with Convolutional Neural Networks has shown great promise in image-based classification and enhancement but is often unsuitable for predictive modeling using features without spatial correlations. We present a feature representation approach termed REFINED (REpresentation of Features as Images with NEighborhood Dependencies) to arrange high-dimensional vectors in a compact image form conducible for CNN-based deep learning. We consider the similarities between features to generate a concise feature map in the form of a two-dimensional image by minimizing the pairwise distance values following a Bayesian Metric Multidimensional Scaling Approach. We hypothesize that this approach enables embedded feature extraction and, integrated with CNN-based deep learning, can boost the predictive accuracy. We illustrate the superior predictive capabilities of the proposed framework as compared to state-of-the-art methodologies in drug sensitivity prediction scenarios using synthetic datasets, drug chemical descriptors as predictors from NCI60, and both transcriptomic information and drug descriptors as predictors from GDSC.

Omid Bazgir
-
[ Visit Poster at Spot A6 in Virtual World ]

Deep learning techniques have revolutionized the field of computational biology, however it is often difficult to assign biological meaning to their results. To improve interpretability, methods have incorporated biological priors, like pathway definitions, directly into the learning task. However, due to the correlated and redundant structure of pathways, it is difficult to determine an appropriate computational representation. Here, we present \textbf{pathway module Variational Autoencoder} (pmVAE). Our method utilizes pathway information by restricting the structure of our VAE to mirror gene-pathway memberships. Its architecture is composed of a set of subnetworks, refered to as pathway modules, that learn interpretable multi-dimensional latent representations by factorizing the latent space according to pathway gene sets. We directly address correlations between pathways by balancing a module-specific local loss and a global reconstruction loss. We demonstrate that these representations are directly interpretable and reveal underlying biology, such as perturbation effects and cell type interactions. We compare pmVAE against two other state-of-the-art methods on a single-cell RNA-seq case-control dataset, and show that our representations are both more discriminative and specific in detecting the perturbed pathways.

Stefan Stark
-
[ Visit Poster at Spot B0 in Virtual World ]

Cellular morphology and dynamic behavior are highly predictive of their function and pathology. However, automated analysis of the morphodynamic states remains challenging for human cells where genetic labeling may not be feasible. We developed DynaMorph – a computational framework that combines quantitative live cell imaging with self-supervised learning and applied it to microglia derived from developing human brain tissue. Our model generates interpretable and generalizable morphological representations for microglia, and we found that microglia adopt distinct morphodynamic states upon exposure to disease-relevant perturbations.

Zhenqin Wu
-
[ Visit Poster at Spot B4 in Virtual World ]

Macromolecules, such as naturally occurring and synthetic proteins and glycans, have diverse chemical structures, varying in monomer composition, connecting bonds and topology. In addition to the chemical diversity, macromolecules usually have opaque structure-activity relationships, making activity prediction and model attribution hard tasks. Recently, we proposed macromolecule graph representation learning, achieving state-of-the-art results in the immunogenicity classification of glycans. Here, we extend this framework to include attribution methods for graph neural networks. We evaluated the performance of 2 attribution methods over 3 model architectures, and an attention attribution for the attention-based model, and demonstrated it for an immunogenic glycan. Our work has two-fold implications - (1) provides attribution-backed chemical insights at the monomer and chemical substructure level, and (2) informs further in silico and wet-lab experiments.

Somesh M Mohapatra
-
[ Visit Poster at Spot C2 in Virtual World ]

Proteins perform various functions in living organisms. The task of automatic protein function is defined as finding appropriate association between proteins and functional labels like Gene Ontology(GO) terms. In this paper, we present Prot-A-GAN: an automatic protein function annotation framework using GAN-like adversarial training for knowledge graph embedding. Following the terminologies of GAN: 1) we train a discriminator using domain-adaptive negative sampling to discriminate positive and negative triples, and 2) we train a generator to guide a random walk over the knowledge graph that identify paths between proteins and GO annotations. We evaluate the method by performing protein function annotation using GO terms on human disease proteins from UniProtKB/SwissProt. As a proof-of-concept, the conducted experiments show promising outcome and open up new avenue for further exploration, exclusively for protein function annotation.

Bishnu Sarker
-
[ Visit Poster at Spot B3 in Virtual World ]

Convolutional neural networks (CNNs) trained to predict regulatory functions from genomic sequence often learn partial or distributed representations of sequence motifs across many first-layer filters, making it challenging to interpret the biological relevance of these models’ learned features. Here we present Genomic Representations with Information Maximization (GRIM), an unsupervised learning method based on the Infomax principle that enables more comprehensive identification of whole sequence motifs learned by CNNs. By performing systematic experiments, we empirically demonstrate that GRIM is able to discover motifs in genomic sequences in situations where supervised learning struggles.

Nicholas Lee
-
[ Visit Poster at Spot B4 in Virtual World ]

TCR-epitope binding is the key mechanism for T cell regulation. Computational prediction of whether a given pair binds is of great interest for understanding the underlying science as well as various clinical applications. Previously developed methods do not account for interrelationship between amino acids and suffer from poor out-of-sample performance. Our model uses the multi-head self attention mechanism to capture biological contextual information and to improve generalization performance. We show that ours outperforms other models and we also demonstrate that the use of attention matrices can improve out-of-sample performance on recent SARS-CoV-2 data.

Michael Cai
-
[ Visit Poster at Spot A1 in Virtual World ]

Graph generative models have been utilized for discovering novel molecules with desired chemical properties. Several of these methods use neural networks to map the latent space to chemical properties and explore the latent space efficiently. We propose using multi-target networks to jointly predict several molecular properties and learn better representations by exploiting auxiliary information. Our joint model outperforms existing methods in property prediction and molecular optimization tasks. We also propose a new benchmark to compare generative models for drug discovery.

Anirudh jain
-
[ Visit Poster at Spot A0 in Virtual World ]

We propose a method called integrated diffusion for combining multimodal datasets, or data gathered via several different measurements on the same system, to create a joint data diffusion operator. As real world data suffers from both local and global noise, we introduce mechanisms to optimally calculate a diffusion operator that reflects the combined information from both modalities. We show the utility of this joint operator in data denoising, visualization and clustering, performing better than other methods when applied to multi-omic data generated from peripheral blood mononuclear cells. Our approach better visualizes the geometry of the joint data, captures known cross-modality associations and identifies known cellular populations. More generally, integrated diffusion is broadly applicable to multimodal datasets generated in many medical and biological systems.

Abhinav Godavarthi