The cost of commitment in option-based hierarchical RL
Abstract
Empirically, option-based hierarchical reinforcement (HRL) learning often produces longer and more diverse options when a deliberation cost is charged at option boundaries. However, when options are executed for many steps under an approximate dynamics model, small model errors compound along the option, degrading the quality of the resulting plan. In this work, we introduce the commitment loss to formalize the tradeoff between deliberation cost and model error as a function of option duration. We characterize how optimal termination probabilities vary with this tradeoff under two model-error mechanisms. First, the model is learned from finite data via maximum-likelihood estimation, producing statistical error that interacts with option duration. Second, we consider an input-driven setting where an exogenous input is only observed at option boundaries and evolves unobserved between them, creating a drift-induced mismatch between planned and realized dynamics. In both cases, we solve for the optimal termination behavior as a function of deliberation cost and the error scale, clarifying the behavior of some popular HRL algorithms that approach the deliberation cost as a heuristic.