Timezone: »

Batch Reinforcement Learning with Hyperparameter Gradients
Byung-Jun Lee · Jongmin Lee · Peter Vrancx · Dongho Kim · Kee-Eung Kim

Tue Jul 14 08:00 AM -- 08:45 AM & Tue Jul 14 07:00 PM -- 07:45 PM (PDT) @ None #None

We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.

Author Information

Byung-Jun Lee (KAIST)
Jongmin Lee (KAIST)
Peter Vrancx (PROWLER.io)
Dongho Kim (Prowler.io)
Kee-Eung Kim (KAIST)

More from the Same Authors