PRICE-RL: Selection–Transmission Decomposed Reinforcement Learning for Sequential Biological Design
Abstract
We introduce PRICE-RL, a reinforcement-learning algorithm for biological sequence design whose update is decomposed exactly, sample-wise, on every batch, into the two components that the Price equation imposes on evolutionary change: a selection term that reweights variants already supported by the policy, and a transmission term that shifts the policy’s support into new regions of sequence space. A PI controller drives the empirical Price ratio, ρₜ = |gS| / (|gS| + |gT|), toward a target derived from the reward landscape’s autocorrelation length, and the same ratio serves as a per-round reward-hacking diagnostic. The cosine identity ĝS + ĝT = ĝpool holds to floating-point precision across 1,600 NK round-seeds without exception. On the 149,361-variant GB1 four-mutation landscape, PRICE-RL ties AdaLead at 500 queries and outperforms it by 12.4% at 8,000 queries, with non-overlapping 95% CIs and Mann–Whitney p < 10⁻³. PRICE-RL also discovers 2.5× more unique top-1% variants at matched fitness. On a deliberately weak proxy reward, ρₜ fires a mean of 18.6 rounds before proxy variance crosses its alarm on 100% of 16 seeds, remaining robust to surrogate noise up to σₙ = 0.40. On the deceptive Trap-K landscape, PRICE-RL reaches the global optimum on every seed at every scale up to N = 120, where AdaLead misses up to 20% of the time. The framework extends to multi-objective reward and to token-level autoregressive RLHF-style policies, with the cosine identity preserved at 1.0000 in each case.