Explicit representation of germline and non-germline residues improves antibody language modeling
Abstract
Antibodies originate from germline immunoglobulin gene templates and are diversified by somatic hypermutation during affinity maturation. This process produces sequences in which conserved germline residues provide structural scaffolding, complemented by rare non-germline substitutions that refine antigen binding and functional properties. Despite this important biological organization, current antibody language models (ALMs) treat all residues equivalently, blurring the distinction between germline-encoded conservation and adaptive non-germline variation. When applied to germline-dominated repertoires—reflecting shared germline starting points from which diversity arises—this induces a germline bias in which rare but functionally important non-germline (NGL) mutations are systematically down-weighted as statistical noise. Here, we introduce PRISM, a germline-aware antibody language model that explicitly represents germline and non-germline residues as distinct token types. This biologically informed representation achieves state-of-the-art pseudo-perplexity in hypervariable complementarity-determining regions (CDRs) and enables controllable sequence generation for precise paratope engineering while preserving germline-mediated framework stability. PRISM further exhibits improved zero-shot prediction of binding affinity relative to existing ALMs. Together, these results demonstrate that incorporating biologically grounded sequence representations substantially improves antibody language modeling.