Poster
in
Workshop: Responsible Decision Making in Dynamic Environments
LPI: Learned Positional Invariances for Transfer of Task Structure and Zero-shot Planning
Tamas Madarasz
Real-world tasks often include interactions with the environment where our actions can drastically change the available or desirable long-term outcomes. One formulation of this in the reinforcement learning setting is in terms of non-Markovian rewards. Here the reward function, and thus the available rewards, are themselves history-dependent, and dynamically change given the agent-environment interactions. An important challenge for navigating such environments is to be able to capture the structure of this dynamic reward function, in a way that is interpretable and allows for optimal planning. This structure, in conjunction with the particular task setting at hand, then determines the optimal order in which actions should be executed, or subtasks completed. Planning methods face the challenge of combinatorial explosion if all such orderings need to be evaluated, however, learning invariances inherent in the task structure can alleviate this pressure. Here we propose a solution to this problem by allowing the planning method to recognise task segments where temporal ordering is irrelevant for predicting reward outcomes downstream. To facilitate this, our agent simultaneously learns to segment a task and predict the changing reward function resulting from its actions, while also learning about the permutation invariances in the its history that are relevant for this prediction. This dual approach can allow zero-shot or few-shot generalisation for complex, dynamic reinforcement learning tasks.