Skip to yearly menu bar Skip to main content


Poster

Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

Fengdi Che · Chenjun Xiao · Jincheng Mei · Bo Dai · Ramki Gummadi · Oscar Ramirez · Christopher Harris · Rupam Mahmood · Dale Schuurmans


Abstract:

We prove that the combination of a target network and over-parameterized linear function approximation ensures the convergence of bootstrapped value estimation, even with off-policy data. That is, the deadly triad can be eliminated bytwo simple augmentations.To establish the main result, we first consider temporal difference estimation for prediction.Here, we show that over-parameterized target TD (OTTD) is convergent on Baird's counterexample and, more generally, that OTTD is convergent on off-policy trajectories collected from episodic Markov decision processes.We also provide high probability bounds on value estimation error.Next, we consider the control scenario and investigate the behaviour of Q-learning. Here, we show that over-parameterized target Q-learning (OTQ) is also convergent on off-policy trajectories collected from episodic Markov decision processes.These results can be extended to continuing tasks with minor modifications.

Live content is unavailable. Log in and register to view live content