Intentional Updates for Streaming Reinforcement Learning
Abstract
In gradient-based learning, a step size chosen in parameter units does not produce a predictable per-step change in the function output. This may lead to instability in the streaming setting (i.e., batch size=1), where stochasticity is not averaged out and update magnitudes can momentarily become arbitrarily big or small. Instead, we propose \emph{intentional updates}: first specify the \emph{intended outcome} of an update and then solve for the step size that approximately achieves it. This strategy has precedent in online supervised linear regression via normalized LMS, which selects a step size to yield a specified change in the function output proportional to the current error. We extend this principle to streaming reinforcement learning by defining appropriate intended outcomes: \emph{Intentional TD} aims for a fixed fractional reduction of the current TD error relative to the momentary bootstrap target, and \emph{Intentional Policy Gradient} aims for a bounded per-step change in the policy, limiting local KL divergence. We develop practical implementations integrating eligibility traces and diagonal scaling; our experiments show that these methods yield state-of-the-art streaming performance often comparable to batch and replay-buffer learning.