Timezone: »

Reward-Free Exploration for Reinforcement Learning
Chi Jin · Akshay Krishnamurthy · Max Simchowitz · Tiancheng Yu

Thu Jul 16 06:00 AM -- 06:45 AM & Thu Jul 16 05:00 PM -- 05:45 PM (PDT) @
Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose the following ``reward-free RL'' framework. In the exploration phase, the agent first collects trajectories from an MDP $M$ without a pre-specified reward function. After exploration, it is tasked with computing a near-policies under the transitions of $\mathcal{M}$ for a collection of given reward functions. This framework is particularly suitable where there are many reward functions of interest, or where the reward function is shaped by an external agent to elicit desired behavior. We give an efficient algorithm that conducts $\widetilde{O}(S^2A\mathrm{poly}(H)/\epsilon^2)$ episodes of exploration, and returns $\epsilon$-suboptimal policies for an arbitrary number of reward functions. We achieve this by finding exploratory policies that jointly visit each ``significant'' state with probability proportional to its maximum visitation probability under any possible policy. Moreover, our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient. Finally, we give a nearly-matching $\Omega(S^2AH^2/\epsilon^2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.

Author Information

Chi Jin (Princeton University)
Akshay Krishnamurthy (Microsoft Research)
Max Simchowitz (UC Berkeley)
Tiancheng Yu (MIT)

I am quant researcher at Two Sigma. I completed my PhD in MIT EECS in 2023.

More from the Same Authors