Align Forward, Adapt Backward: Closing the Discretization Gap in Logic Gate Networks
Abstract
Differentiable discrete selection uses soft mixtures during training but hard selection at deployment, resulting in a training-inference gap. We decompose this gap into selection gap (method-dependent, reducible) and computation gap (input-dependent, irreducible). Our key finding: the selection gap is determined by forward-pass structure, not backward-pass gradients. Methods using hard selection during training achieve zero selection gap by construction, while mixture methods exhibit gaps even with identical gradient estimators. This occurs because mixtures reward hedging across options, while deployment requires commitment to one. We propose CAGE (Confidence-Adaptive Gate Exploration), which addresses optimization entirely in the backward pass by adapting temperature based on selection confidence. We also identify a critical failure mode: Gumbel-ST suffers 40--50 percentage point accuracy collapse at low temperatures, which CAGE prevents. Experiments on logic gate networks validate the theory: hard selection achieves 98% accuracy with zero gap across all temperatures.