Instruction Bleed: A Theory-Anchored Benchmark for Cross-Module Interference in Prompt-Composed Agents
Abstract
Transformer self-attention computes global pairwise interactions across its input, leaving no architectural isolation between concatenated prompt modules. Three architectural inductive biases — proactive interference, coverage-bounded compositional generalization, and format sensitivity — jointly predict cross-module behavioral interference not derivable from per-module testing, yet no current agent benchmark measures it. We contribute a theory-anchored benchmark protocol whose three perturbation channels (volume, content, form) each isolate one of the predicted mechanisms, with paired effect sizes and bootstrap CIs as the calibrated readout. On a deployed job-evaluation agent (Claude Sonnet 4.6, 144 trials), only the content channel produces a detectable effect (Cohen's d = 0.63, bootstrap 95% CI [+0.03, +0.31], excluding zero); volume and form CIs include zero, discriminatively localizing interference to coverage-bounded composition. We formalize compositional behavioral leakage (CBL) and derive falsifiable predictions framing the multi-system replication program.