Discrete Diffusion with Physical Mass Constraints for \emph{De Novo} Peptide Sequencing
Zeyu An ⋅ Wanyu LIN
Abstract
{\em De novo} peptide sequencing is a pivotal technique that directly reconstructs amino acid sequences from tandem mass spectrometry (MS/MS) data; it enables the identification of novel proteins and variants absent from reference databases. Previous methods are typically based on autoregressive (AR) decoding or one-shot generation. The AR-based methods conflict with the bidirectional and globally constrained nature of MS/MS evidence and inevitably accumulate errors, while one-shot generation does not explicitly enforce physical constraints, failing to produce chemically valid and reliable peptides in a single pass. Accurate sequencing necessitates reasoning over the entire peptide simultaneously, enabling iterative self-correction under global constraints. To this end, we introduce $\textbf{PhysNovo}$, a novel paradigm that harnesses discrete diffusion to enable simultaneous global reasoning and iterative refinement. Specifically, PhysNovo reformulates sequencing as a $\textbf{phys}$ically mass-constrained reasoning process by embedding a knapsack-based feasibility kernel to enforce exact precursor mass consistency. By conditioning the diffusion process on global spectral context, PhysNovo supports abductive reasoning where bidirectional evidence is exploited to iteratively resolve local inconsistencies and ensure physically valid predictions. PhysNovo achieves state-of-the-art performance, exceeding baselines by over 2\% in precision, with larger gains on out-of-distribution data.
Successful Page Load