Timezone: »

Reward-Mixing MDPs with Few Latent Contexts are Learnable
Jeongyeol Kwon · Yonathan Efroni · Constantine Caramanis · Shie Mannor

Tue Jul 25 05:00 PM -- 06:30 PM (PDT) @ Exhibit Hall 1 #434
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among $M$ candidates and an agent interacts with the MDP throughout the episode for $H$ time steps. Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model. Prior work established an upper bound for RMMDPs with $M=2$. In this work, we resolve several open questions for the general RMMDP setting. We consider an arbitrary $M\ge2$ and provide a sample-efficient algorithm--$EM^2$--that outputs an $\epsilon$-optimal policy using $O \left(\epsilon^{-2} \cdot S^d A^d \cdot \text{poly}(H, Z)^d \right)$ episodes, where $S, A$ are the number of states and actions respectively, $H$ is the time-horizon, $Z$ is the support size of reward distributions and $d=O(\min(M,H))$. We also provide a $(SA)^{\Omega(\sqrt{M})} / \epsilon^{2}$ lower bound, supporting that super-polynomial sample complexity in $M$ is necessary.

Author Information

Jeongyeol Kwon (University of Wisconsin-Madison)

I am currently a PostDoc at University of Wisconsin-Madison, working with Prof. Robert Nowak. Prior to joining UW-Madison, I received my Ph.D. in ECE department at UT Austin, where I had wonderful years of learning and working with my advisor Prof. Constantine Caramanis. I got my Bachelor’s Degree in Electrical and Computer Engineering from Seoul National University (SNU) in 2016.

Yonathan Efroni (Meta)
Constantine Caramanis (University of Texas)
Shie Mannor (Technion)

More from the Same Authors