Skip to yearly menu bar Skip to main content


Poster

Information-Directed Pessimism for Offline Reinforcement Learning

Alec Koppel · Sujay Bhatt · Jiacheng Guo · Joe Eappen · Mengdi Wang · Sumitra Ganesh


Abstract:

Policy optimization from batch data, i.e., offline reinforcement learning (RL) is important when collecting data from a current policy is unavailable. This setting incurs distribution mismatch between batch training data and trajectories from the current policy. Pessimistic offsets estimate mismatch using concentration bounds, which possess strong theoretical guarantees and simplicity of implementation. Prior offsets hypothesize a sub-Gaussian representation of mismatch that may be conservative in sparse data regions and less so otherwise, which can result in under-performing their no-penalty variants in practice. We derive a new pessimistic penalty as the distance between the data and the true distribution using an evaluable one-sample test known as Stein Discrepancy that requires minimal smoothness conditions, and noticeably, allows for non-Gaussianity when mismatch is interpreted as a distribution over next states. This entity forms a quantifier of information in offline data, which justifies calling this approach \emph{information-directed pessimism} (IDP) for offline RL. we establish this new penalty yields practical gains in performance while generalizing the regret of prior art to non-Gaussian mismatch.

Live content is unavailable. Log in and register to view live content