The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning,
and bootstrapping simultaneously.
In this paper,
we investigate the target network as a tool for breaking the deadly triad,
providing theoretical support for the conventional wisdom that a target network stabilizes training.
We first propose and analyze a novel target network update rule which augments the commonly used Polyak-averaging style update with two projections.
We then apply the target network and ridge regularization in several divergent algorithms and show their convergence to regularized TD fixed points.
are off-policy with linear function approximation and bootstrapping,
spanning both policy evaluation and control, as well as
both discounted and average-reward settings.
we provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.