Timezone: »
Geometric deep learning has broad applications in biology, a domain where relational structure in data is often intrinsic to modelling the underlying phenomena. Currently, efforts in both geometric deep learning and, more broadly, deep learning applied to biomolecular tasks have been hampered by a scarcity of appropriate datasets accessible to domain specialists and machine learning researchers alike. To address this, we introduce Graphein as a turn-key tool for transforming raw data from widely-used bioinformatics databases into machine learning-ready datasets in a high-throughput and flexible manner. Graphein is a Python library for constructing graph and surface-mesh representations of biomolecular structures, such as proteins, nucleic acids and small molecules, and biological interaction networks for computational analysis and machine learning. Graphein provides utilities for data retrieval from widely-used bioinformatics databases for structural data, including the Protein Data Bank, the AlphaFold Structure Database, chemical data from ZINC and ChEMBL, and for biomolecular interaction networks from STRINGdb, BioGrid, TRRUST and RegNetwork. The library interfaces with popular geometric deep learning libraries: DGL, Jraph, PyTorch Geometric and PyTorch3D though remains framework agnostic as it is built on top of the PyData ecosystem to enable inter-operability with scientific computing tools and libraries. Graphein is designed to be highly flexible, allowing the user to specify each step of the data preparation, scalable to facilitate working with large protein complexes and interaction graphs, and contains useful pre-processing tools for preparing experimental files. Graphein facilitates network-based, graph-theoretic and topological analyses of structural and interaction datasets in a high-throughput manner. We envision that Graphein will facilitate developments in computational biology, graph representation learning and drug discovery. \Availability and implementation: Graphein is written in Python. Source code, example usage and tutorials, datasets, and documentation are made freely available under the MIT License at the following URL: https://anonymous.4open.science/r/graphein-3472/README.md
Author Information
Arian Jamasb (University of Cambridge)
Ramon Viñas Torné (University of Cambridge)
Eric Ma (PyMC Labs)
Eric is a Principal Data Scientist at Moderna supporting research data science. Prior to Moderna, he was at the Novartis Institutes for Biomedical Research conducting biomedical data science research with a focus on using Bayesian statistical methods in the service of making medicines for patients. Prior to Novartis, he was an Insight Health Data Fellow in the summer of 2017 and defended his doctoral thesis in the Department of Biological Engineering at MIT in the spring of 2017. Eric is also an open-source software developer and has led the development of pyjanitor, a clean API for cleaning data in Python, and nxviz, a visualization package for NetworkX. In addition, he gives back to the open-source community through code contributions to multiple projects. His personal life motto is found in the Gospel of Luke 12:48.
Yuanqi Du (Cornell University)
Charles Harris (University of Cambridge)
Kexin Huang (Harvard University)
Dominic Hall (University of Cambridge)
Pietro Lió (University of Cambridge)
Tom Blundell
More from the Same Authors
-
2020 : (#95 / Sess. 2) Graphein - a Python Library for Geometric Deep Learning and Network Analysis on Protein Structures »
Arian Jamasb -
2021 : a-VAEs : Optimising variational inference by learning data-dependent divergence skew »
Jacob Deasy · Tom McIver · Nikola Simidjievski · Pietro Lió -
2021 : High Frequency EEG Artifact Detection with Uncertainty via Early Exit Paradigm »
Lorena Qendro · Alex Campbell · Pietro Lió · Cecilia Mascolo -
2021 : GCExplainer: Human-in-the-Loop Concept-based Explanations for Graph Neural Networks »
Lucie Charlotte Magister · Dmitry Kazhdan · Vikash Singh · Pietro Lió -
2022 : Path Integral Stochastic Optimal Control for Sampling Transition Paths »
Lars Holdijk · Yuanqi Du · Priyank Jaini · Ferry Hooft · Bernd Ensing · Max Welling -
2022 : Featurizations Matter: A Multiview Contrastive Learning Approach to Molecular Pretraining »
Yanqiao Zhu · Dingshuo Chen · Yuanqi Du · Yingze Wang · Qiang Liu · Shu Wu -
2022 : Pre-training Graph Neural Networks for Molecular Representations: Retrospect and Prospect »
Jun Xia · Yanqiao Zhu · Yuanqi Du · Stan Z. Li -
2022 : GAUCHE: A Library for Gaussian Processes in Chemistry »
Ryan-Rhys Griffiths · Leo Klarner · Henry Moss · Aditya Ravuri · Sang Truong · Yuanqi Du · Arian Jamasb · Julius Schwartz · Austin Tripp · Bojana Ranković · Philippe Schwaller · Gregory Kell · Anthony Bourached · Alexander Chan · Jacob Moss · Chengzhi Guo · Alpha Lee · Jian Tang -
2022 : Protein Representation Learning by Geometric Structure Pretraining »
Zuobai Zhang · Zuobai Zhang · Minghao Xu · Minghao Xu · Arian Jamasb · Arian Jamasb · Vijil Chenthamarakshan · Vijil Chenthamarakshan · Aurelie Lozano · Payel Das · Payel Das · Jian Tang · Jian Tang -
2023 Workshop: Structured Probabilistic Inference and Generative Modeling »
Dinghuai Zhang · Yuanqi Du · Chenlin Meng · Shawn Tan · Yingzhen Li · Max Welling · Yoshua Bengio -
2022 Workshop: AI for Science »
Yuanqi Du · Tianfan Fu · Wenhao Gao · Kexin Huang · Shengchao Liu · Ziming Liu · Hanchen Wang · Connor Coley · Le Song · Linfeng Zhang · Marinka Zitnik -
2022 : Sheaf Neural Networks with Connection Laplacians »
Federico Barbero · Cristian Bodnar · Haitz Sáez de Ocáriz Borde · Michael Bronstein · Petar Veličković · Pietro Lió -
2022 : Approximate Equivariance SO(3) Needlet Convolution »
Kai Yi · · Bingxin Zhou · Pietro Lió · Yanan Fan · Jan Hamann -
2022 Poster: Attentional Meta-learners for Few-shot Polythetic Classification »
Ben Day · Ramon Viñas Torné · Nikola Simidjievski · Pietro Lió -
2022 Spotlight: Attentional Meta-learners for Few-shot Polythetic Classification »
Ben Day · Ramon Viñas Torné · Nikola Simidjievski · Pietro Lió -
2022 Poster: 3D Infomax improves GNNs for Molecular Property Prediction »
Hannes Stärk · Dominique Beaini · Gabriele Corso · Prudencio Tossou · Christian Dallago · Stephan Günnemann · Pietro Lió -
2022 Spotlight: 3D Infomax improves GNNs for Molecular Property Prediction »
Hannes Stärk · Dominique Beaini · Gabriele Corso · Prudencio Tossou · Christian Dallago · Stephan Günnemann · Pietro Lió -
2021 Poster: How Framelets Enhance Graph Neural Networks »
Xuebin Zheng · Bingxin Zhou · Junbin Gao · Yuguang Wang · Pietro Lió · Ming Li · Guido Montufar -
2021 Poster: Directional Graph Networks »
Dominique Beaini · Saro Passaro · Vincent Létourneau · Will Hamilton · Gabriele Corso · Pietro Lió -
2021 Poster: Weisfeiler and Lehman Go Topological: Message Passing Simplicial Networks »
Cristian Bodnar · Fabrizio Frasca · Yuguang Wang · Nina Otter · Guido Montufar · Pietro Lió · Michael Bronstein -
2021 Oral: Directional Graph Networks »
Dominique Beaini · Saro Passaro · Vincent Létourneau · Will Hamilton · Gabriele Corso · Pietro Lió -
2021 Spotlight: Weisfeiler and Lehman Go Topological: Message Passing Simplicial Networks »
Cristian Bodnar · Fabrizio Frasca · Yuguang Wang · Nina Otter · Guido Montufar · Pietro Lió · Michael Bronstein -
2021 Spotlight: How Framelets Enhance Graph Neural Networks »
Xuebin Zheng · Bingxin Zhou · Junbin Gao · Yuguang Wang · Pietro Lió · Ming Li · Guido Montufar