ICML Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Poster
in
Workshop: Workshop on Reinforcement Learning Theory

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Tengyang Xie · Nan Jiang · Huan Wang · Caiming Xiong · Yu Bai

[ Abstract ]

[ Visit Poster at Spot C1 in Virtual World ]

Abstract: This paper initiates the theoretical study of \emph{policy finetuning} towards bridging the gap between (sample-efficient) online RL and offline RL. In this problem, the online RL learner has additional access to a

reference policy''

μ

$\mu$ close to the optimal policy

π_{⋆}

$\pi_\star$ in a certain sense. We first design a sharp \emph{offline reduction} algorithm---which simply executes

μ

$\mu$ and runs offline policy optimization on the collected dataset---that finds an

ε

$\varepsilon$ near-optimal policy within

\tilde{O} (H^{3} S C^{⋆} / ε^{2})

$\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where

C^{⋆}

$C^\star$ is the single-policy concentrability coefficient between

μ

$\mu$ and

π_{⋆}

$\pi_\star$ . This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an

Ω (H^{3} S min {C^{⋆}, A} / ε^{2})

$\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for \emph{any} policy finetuning algorithm, including those that can adaptively explore the environment. This implies that---perhaps surprisingly---the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use

μ

$\mu$ . Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where

μ

$\mu$ only satisfies concentrability partially up to a certain time step. Overall, our results offer a quantitative understanding on the benefit of a good reference policy, and make a step towards bridging offline and online RL.

Chat is not available.

Poster in Workshop: Workshop on Reinforcement Learning Theory

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Tengyang Xie · Nan Jiang · Huan Wang · Caiming Xiong · Yu Bai

Poster
in
Workshop: Workshop on Reinforcement Learning Theory