Spotlight
in
Workshop: Accessible and Efficient Foundation Models for Biological Discovery
ProtMamba: a homology-aware but alignment-free protein state space model
Damiano Sgarbossa · Cyril Malbranke · Anne-Florence Bitbol
Keywords: [ protein fitness prediction ] [ Protein Design ] [ foundation model ] [ state space model ]
Protein design has important implications for drug discovery, personalized medicine, and biotechnology. Models based on multiple sequence alignments efficiently capture the evolutionary information in homologous protein sequences, but multiple sequence alignment construction is imperfect. We present ProtMamba, a homology-aware but alignment-free protein language model based on the Mamba architecture. In contrast with attention-based models, ProtMamba efficiently handles very long context, comprising hundreds of protein sequences. Trained on a large dataset of concatenated homologous sequences, ProtMamba combines autoregressive and masked language modeling through a fill-in-the-middle objective. We demonstrate ProtMamba’s usefulness for the generation of novel sequences and for fitness prediction. ProtMamba reaches competitive performance with other protein language models despite its smaller size, which sheds light on the importance of long-context conditioning.