Keywords: [ RL: Function Approximation ]
The Q-learning algorithm is a simple, fundamental and practically very effective reinforcement learning algorithm. However, the basic protocol can exhibit an unstable behavior when implemented even with simple linear function approximation. While tools like target networks and experience replayare often implemented to stabilize the learning process, the individual contribution of each of these mechanisms is not well understood theoretically.This work proposes an exploration variant of the basicQ-learning protocol with linear function approximation. Our modular analysis illustrates the role played by each algorithmic tool that we adopt:a second order update rule,a set of target networks, and a mechanism akin to experience replay.Together, they enable state of the art regret bounds on linear MDPs while preserving the most prominent feature of the algorithm, namely a space complexity independent of the number of steps elapsed. Furthermore, we show that the performance of the algorithm degrades very gracefully under a new, more permissive notion of approximation error. Finally, the algorithm partially inherits problem dependent regret bounds,function of the number of `effective' feature dimension.