From Generative to Episodic: Sample-Efficient Replicable RL
Max Hopkins ⋅ Sihan Liu ⋅ Christopher Ye ⋅ Yuichi Yoshida
Abstract
The epidemic failure of replicability across empirical science and machine learning has recently motivated the formal study of replicable learning algorithms [Impagliazzo et al. (2022)]. In contrast to batch settings (i.e. data comes from a fixed i.i.d. source) where the cost of replicability is relatively well understood, there remain significant gaps in our knowledge for control settings like reinforcement learning where an agent must interact directly with a shifting environment. Indeed, there is a large gap between the best upper bound of $\tilde{O}(S^7 A^7)$ [Eaton et al. (2023)] for RL with exploration, and $\tilde{O}(S^2 A^2)$ [Karbasi et al. (2023)] for the RL `batch' setting. This gap raises a key question in the broader theory of replicability: Is replicable exploration inherently more expensive than batch learning? Is sample-efficient replicable RL even possible? In this work, we (nearly) resolve this problem (for low-horizon tabular MDPs): exploration is not a significant barrier to replicable learning! Our main result is a replicable RL algorithm on $\tilde{O}(S^2A)$ samples, bridging the gap between the generative and episodic settings. We complement this with a lower bound in the episodic setting of $\tilde{\Omega}(S^2)$ showcasing the near-optimality of our algorithm with respect to the state space $S$.
Successful Page Load