Reusing Trajectories in Policy Gradients Enables Fast Convergence
Alessandro Montenegro ⋅ Federico Mansutti ⋅ Marco Mussi ⋅ Matteo Papini ⋅ Alberto Maria Metelli
Abstract
*Policy gradient* (PG) methods are a class of effective *reinforcement learning* algorithms, particularly when dealing with continuous control problems. They rely on fresh *on-policy* data, making them sample-inefficient and requiring $\mathcal{O}(\epsilon^{-2})$ trajectories to reach an $\epsilon$-approximate stationary point. A common strategy to improve efficiency is to *reuse* information from past iterations, such as previous *gradients* or *trajectories*, leading to *off-policy* PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $\mathcal{O}(\epsilon^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a *power mean*-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $\omega$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\widetilde{\mathcal{O}}(\epsilon^{-2}\omega^{-1})$. When reusing *all* available past trajectories, this leads to a rate of $\widetilde{\mathcal{O}}(\epsilon^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.
Successful Page Load