ICML Poster Near Optimal Reward-Free Reinforcement Learning

Poster

Near Optimal Reward-Free Reinforcement Learning

Zhang Zihan · Simon Du · Xiangyang Ji

Virtual

Keywords: [ RL, Decisions and Control Theory ]

[ Abstract ] [ Paper PDF ]

[ Slides]

[ Paper ]

[ Visit Poster at Spot A5 in Virtual World ]

Abstract: We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases: in the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal; in the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. %This framework is suitable for batch RL setting and the setting where there are multiple reward functions of interes We give a new efficient algorithm, \textbf{S}taged \textbf{S}ampling + \textbf{T}runcated \textbf{P}lanning (\algoname), which interacts with the environment at most

$O\left( \frac{S^2A}{\epsilon^2}\poly\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase, where

$S$ is the size of state space,

$A$ is the size of action space,

$H$ is the planning horizon, and

$\epsilon$ is the target accuracy relative to the total reward. Notably, our sample complexity scales only \emph{logarithmically} with

$H$ , in contrast to all existing results which scale \emph{polynomially} with

$H$ . Furthermore, this bound matches the minimax lower bound

$\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition for the dataset to plan for an

$\epsilon$ -suboptimal policy % for any totally bounded reward function ; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.

Chat is not available.