We thank all reviewers for their insightful comments. $ Reviewer 1 suggests several improvements to our experiments. 1. Other ways of explicit opponent modeling. We agree that there is a vast literature on explicit opponent modeling. Our main concern is that if a separate model is trained, it’s not clear how to incorporate its output (e.g. opponent action) to the policy in a generalizable way (see discussion in Section 5). However, our multitasking formulation allows the “explicit model” to affect the policy by affecting representation of the opponent without much additional work. We think the reviewer’s proposal of supervised training and parameter sharing is similar to multitasking, except that we did not use Q-values as the supervision signal. Swapping in another form of supervision is easy to do in our model though. 2. Experiments on more complex games. The two games are chosen because they have clear strategies and it’s easy to construct/collect behavior of different types of opponents. Our next step is to work on real-time strategy games such as Dota. These games are challenging (even with a consistent/known opponent) given the large number of (hierarchical) actions, strategies and states. We agree it would be a good test bed for opponent modeling; however, we wanted to make sure that the proposed method works in relatively simple settings and we have a good understanding of it before we moved on to more complex settings. In addition, we’d like to point out that the trivia experiment is among the first few deep RL works that learn to directly compete with humans. 3. Agents playing against each other (DQN vs. DRON-concat vs. DRON-moe). If this is for testing only, both agents’ policies are fixed thus opponent modeling is not needed. However it would be interesting to see how they behave during training where the opponent’s policy improves over time--this corresponds to an online version of our algorithm. Reviewer 2 is concerned with handcrafted opponent features and the experimental results. 1. Opponent features. First, two clarifications on how features are computed: (1) for soccer they are computed progressively and reset at the beginning of each game; (2) for trivia they are computed on the training set since an opponent may be seen repeatedly and we go through the same set multiple times. Learning from raw actions is an interesting idea. We were worried that a sequence of actions in a time window may not provide enough history, whereas it’s easy to summarize opponent’s behavior by handcrafting features. Using an RNN may alleviate this problem and we’ll try that in future. 2. Number of opponents. Our analysis in Section 3.1 assumes multiple opponents. Therefore extending the framework is straightforward if the number of opponents/actions is not too large. One way is to treat all opponents as a joint agent (i.e. predicting their joint action/policy). This way we can use the same network architecture as that with a single opponent. Another way is to add more opponent networks (one for each opponent). We’ll discuss this in the final version. 3. Variance of the result. The literature gives examples of large variance of DQN-based models in many settings (e.g. Narasimhan et al., 2015; Parisotto et al., 2016). For trivia, we have a dedicated dev set (see clarification below) for model selection, the “statistically significantly better” model is selected based on performance on the dev set and is not by chance. Re other questions: - You’re right that Eq. 1 is for the optimal policy. We’ll correct this. - "Each expert network responds to a possible reward given the opponent's move": This is motivated by Eq. 1 where the Q-value is averaged over the distribution of opponent actions. - For quiz bowl, we split all questions (with human buzzes) randomly into train, dev, test set. - We actually have an avoiding opponent action for soccer but forgot to mention it. Will add. - "We approximate the ground truth for (a) by min(1, t/buzz position)": Once the opponent has buzzed, there’s no point prediction how far away we’re from the buzz since it doesn’t affect the reward, hence the capping at 1. - The 10/15 rewards for DQN-self look arbitrary: These are actual scoring rules used in quiz bowl. We used the DQN framework for controlling hyper-parameters and ease of implementation. It is essentially cost-sensitive classification (through reduction to regression (Langford and Beygelzimer, 2009)). We thank reviewer 3 for the approval. We’d like to further point out that the results for soccer is at least as good as that for trivia. In Table 1, all opponent models are significantly better than DQN; the gap is even larger when we consider the mean reward. Finally we thank all reviewers for pointing out our typos and ambiguous statements. We’ll correct and clarify them in the final version.