Nested birth-death processes are competitive with parameter-heavy neural networks as time-dependent models of protein evolution
Abstract
Despite the success of large transformers at modeling variable-length protein sequences, most statistical phylogenetics analyses use relatively simple continuous-time finite-state Markov models of point substitution to describe molecular evolution, keeping sequence length fixed and ignoring insertions and deletions (indels) entirely. The simplistic assumptions of these models limit the realism of such analyses. We extend the TKF92 model - the canonical hierarchical model combining an outer birth-death process for indels with an inner finite-state Markov chain for substitutions - by introducing additional nesting and latent states. We compare these TKF92 extensions (which are exactly solvable, and in which evolutionary time naturally appears as a matrix exponential coefficient) to two classes of neural seq2seq models that take evolutionary time as an input feature: the first class of model being constrained to enforce a TKF92-like structure, and the second lacking any such constraint. We evaluate the per-character perplexities of all models on splits of the PFam database of aligned protein domains. A nested TKF-based model with only 32,000 parameters is highly competitive with neural networks containing tens of millions of parameters, outperforming all but two of the neural architectures tested. Our results indicate that approaches grounded in molecular evolutionary theory may be more parameter-efficient and provide a better fit to real alignments than unconstrained alternatives, supporting the incorporation of classical model structure within future neural phylogenetic approaches.