Paper ID: 829 Title: Opponent Modeling in Deep Reinforcement Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is about modeling opponents in reinforcement learning with approximation of the Q-value function with a Deep Q-Network (DQN). The core idea is to provide opponents' features as input to the network. In addition, extra supervised signals related to the opponents (ex: their actions, or their types obtained from some categorization of opponents) can be used to encourage internal representations of the network to distinguish between different kinds of opponents. Two different network architectures are proposed and compared in experiments, consisting of a grid world soccer game and a trivia game (both 2-player games). Clarity - Justification: Well motivated and clearly explained, but experiments would be hard to reproduce without more details. Significance - Justification: No particularly surprising nor impressive results, but clearly brings an improvement. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is an interesting topic and the proposed algorithms are well motivated, as well as clearly explained. There is no doubt that in these experiments this opponent modeling approach outperforms a basic DQN unaware of the varying opponents' strategies, and it seems to me this is already a meaningful contribution, even if unsurprising. One thing I find a bit disappointing is the need for handcrafted opponent's features. Before I explain what I mean here, I want to point out it is not clear to me on which data these features are computed in experiments (e.g. in soccer is there some kind of "reset" when the opponent may switch strategies, to focus on his recent behavior? In trivia game are features computed once on the training set or are they built progressively as the agent plays more matches?). In any case, what I meant is that we are required to identify by ourselves meaningful opponents' features, while it would have been more interesting if the agent could do this job by itself, by using obervations on opponents' actions to build internal representations of these agents. Part of the reinforcement learning task would also be to take actions leading to observations that help identify reliably the type of opponent. Besides the above, my main concern is related to the experimental results, which I do not find entirely convincing. First, the two tasks only involve one opponent, while it would have been interesting to show how this approach can be applied to a setting with multiple opponents. I am also worried by the high variations in performance over time, shown in Fig. 4 and 9 (overall the proposed methods are more stable than basic DQN but may still exhibit big changes, especially on the trivia game). Performance actually seems to "converge" pretty fast and it is not clear what we gain by running more than 20 epochs. In the end I wonder how much of the differences in comparing models is due to chance (being "statistically significantly better" is meaningless if on the next epoch the algorithm becomes "statistically significantly worse"). Other remarks & questions: - Eq. 1 is for a generic policy pi but it seems to me it only holds for the optimal policy (or at least a policy that always selects the action maximizing its true Q value) - The DQN algorithm is not clearly explained, if you have room for a summary algorithm, that would be nice - "Each expert network responds to a possible reward given the opponent's move": not clear what this means. It sounds like each expert is associated to one opponent's move, which is not the case. - Unless I missed it \phi_o is not defined - There are mentions of a "development set" (l. 372) and a "test set" (l. 655) which do not seem to be well defined - "We define a player's move by four cases: approaching the agent, approaching the agent's goal, approaching self goal and standing still": why no "moving away from agent", which is part of the hand-crafted bot ("avoiding opponent") and thus may help recognize a defensive opponent? - "we see no significant difference in varying the number of experts" (l. 536): isn't it expected since there are only two opponent types? - "We approximate the ground truth for (a) by min(1, t/buzz position)": why cap it at 1? - "receives reward 10 for buzz and -10 for wait. When the answer prediction is incorrect, it receives reward -15 for buzz and 15 for wait": this looks somewhat arbitrary. Why use the DQN reinforcement learning framework in this situation (with gamma=0) instead of making it a basic classification task? Typos: - l.78: "use" => uses - l. 479: "DQNbaseline" - l. 623: "concatinate" ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an approach to agent modeling using deep neural networks in the context of reinforcement learning. In particular, it investigates the addition of an opponent modeling network in the context of Deep Q Networks using either a concatenation or a mixture of experts / gating network approach. In addition, the paper discusses the inclusion of additional information into the learning of the opponent network and investigate the effect of this (in particular opponent action or opponent type) on the overall task learning performance. The paper evaluates the approaches on a simple, discretized soccer and a quiz bowl domain, showing improvement using the opponent model and indicating differences between concatenation and mixture of expert approaches in terms of the utility of additional information. The main contributions of the paper are the architecture for combining deep network opponent models with deep Q networks for reinforcement learning and the evaluation of the benefits of different types of information. Clarity - Justification: The paper is well written and easy to follow. The paper is well structured and contains good description and discussion of all the components. Significance - Justification: Allowing the learning of opponent models and their inclusion into a deep reinforcement learning framework is important for domains with opponents that have varying strategies. The paper presents two approaches to include these models, the concatenation and the mixture of experts approach. While opponent models in reinforcement learning have been used before, the main novelty in the paper lies in the application and study of deep network opponent models to DQN. The discussion and experiments are useful to indicate potential differences between the different approaches that would be beneficial for other researchers to be able to choose the appropriate type of model to use for their application. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is well written and addresses an interesting problem. It presents an application for the idea of implicit opponent modeling to reinforcement learning and provides insight into different approaches/architectures to achieve the integration of these models with deep reinforcement learning networks. The results presented show a significant benefit of these opponent models in the quid bowl (and to a lesser degree in the soccer) domain. In addition, the experiments provide insight into the effects of different choices (in particular concatenation vs mixture of experts) of combining opponent and Q value networks. Overall, the paper provides valuable insights to researchers in RL fields for the design and treatment of opponent information and promise to help improve learning agent performance. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an implicit opponent modelling method for deep Q-learning (DQN). Classic DQN models Q(s,a) with a neural network that approximates it. Here (in a multi-agent setting, possibly with opponents/others: o), they model Q(s,a | \pi_o) as in DQN and add another neural network that approximates \pi_o. They call the combination of these two neural networks DRON. They first concatenate the two hidden representations (DRON-concat), as a baseline with opponent modelling, and also present a mixture of experts (DRON-moe) version, that gates experts for each possible rewards given the opponent's move. They show how this improves over a baseline DQN (and DRON-moe over DRON-concat) by experiments in a simulated soccer game, and a trivia game. Overall, this is a strong, substantial contribution, well-written paper, which seem a bit lacking in the experimental validation (thus weak accept). Clarity - Justification: The paper is mostly clear and easy to read. I think I could reproduce their results from the paper! I would suggest: - In 3.2, being a bit clearer on the fact that DRON-concat does not necessarily respect equation 1. - To explain (in words, both for soccer and for quiz bowl) why the opponents that you use are good baselines (strong baselines are important for your evaluation method). Form: - Figure 9 is not that informative. - Explain the setting of quiz bowl (4.2) with a picture (as in Figure 3). I hesitated with "Excellent" for Clarity. Significance - Justification: The authors provide a principled way to do implicit "policy of other agents" modelling. While the multitask learning (explicit opponent modelling) results are a mixed bag, the methods introduced in the paper show a clear gain over classic DQN for the tested experimental setups. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A few things that I think could be done better: - the link between multitasking and explicit modelling. It seems to be that explicit modelling could be so many other different things (like training a Q(s,a)^{opponent} in a supervised way (e.g. reward is known or action is known (as here in the multitask setting)) that shares most parameters with Q(s,a)^{our_player}. Why is it done this way? Why do you consider that this is a good way to do explicit opponent modelling? - the contribution seems to be clearly here, the experiments seem a bit toy-ish (more complex games / settings could be used). - alongside the above remark, in particular even for the two given setups, there are no results of DQN vs. DRON-concat vs. DRON-moe (playing each other, without self-play training, and with it even maybe...), that could extend "rule-based agent + classic DQN" results (and lead to insights?) without much more work. =====