Paper ID: 985 Title: Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The work builds on the interesting "doubly robust" estimator work, which uses an approximate model along with data from a known policy to make predictions for the value of a distinct evaluation policy. The current work extends the state of the art by removing a dependence on having a finite horizon. The primary tool is the use of weighted importance sampling. The authors also show that their technique is a "strongly consistent estimator". They go on to propose another algorithm that builds on their algorithm by blending model-based and importance sampling together. Their algorithm puts weight on different j-step estimators in such a way that it minimizes MSE---an idea that could be of use in the TD literature as well. The resulting algorithm has good justification and good preliminary empirical results. Clarity - Justification: Minimal typos. Good development of ideas. Significance - Justification: These initial empirical results appear to be very promising. The overall design of the algorithms (WDR, COPE, and MAGIC) is well motivated and insightful. The problem itself of offline policy evaluation seems important, if somewhat under studied so far. (I believe that work like this is very valuable stepping stone to making it possible to evaluate RL algorithms on shared datasets akin to what the supervised learning community has done.) Additional empirical evaluations (collected data on a real task) are in order. Also, it seems important that the test problems used be described as they are likely to be used in future work in this direction. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): "model and guided importance sampling combining". I think "combined" works better. "within an order of WDR's" -> "within an order of magnitude of WDR's"? (Happens a few times.) What do ModelFail and ModelWin look like? I think it's pretty important to show these test problems. "then (b) is not necessarily zero.": Might be worth noting that this idea has appeared in the RL literature a few times. One notable example: http://castlelab.princeton.edu/ORF569papers/Baird%201995%20residual.pdf . "If the approximate model uses function approximation or if there is some partial observability, then the approximate model may not converge to the true MDP.": A similar argument has been used to justify the use of TD(lambda). For lambda=0, it's very much like model-based estimators and its accuracy is very dependent on Markovian transitions. For lambda=1, one could make an analogy with importance sampling estimators. The use of lambda between 0 and 1 can be motivated by non-Markovian domains that make depending on the model risky. Is there a deeper connection here? (Indeed, your j-step return looks an awful lot like the returns used in derivations of TD(lambda). Ah, ok, a connection is made a bit later.) "So, as j increases, we expect the variance of the return to increase, but the bias to decrease." That sounds like something worth demonstrating. "dfirection." -> "direction." ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides two extensions to a recently proposed doubly robust (DR) estimator for off-policy evaluation. The first extension is Weighted DR, whose relationship w.r.t. DR is similar to WIS w.r.t. IS. The second extension is a more aggressive incorporation of model-based estimates to improve the accuracy of the point estimate, using ideas from the recently proposed Omega-return framework (which is an alternative to the lambda-return used in temporal difference learning). While taking model-based estimates as part of a plain average usually introduces bias into the estimate, the paper provides a mechanism to determine the weights in the average, which asymptotically assigns 0 weight to biased estimates and guarantees consistency. Thorough empirical evaluation is done to show the privilege of the proposed algorithms over existing ones. Clarity - Justification: The overall presentation is clear; there are some minor issues with the use of notations. Significance - Justification: The paper contains enough novel and useful ideas. The thorough empirical evaluation is also valuable. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall I like this paper and recommend accepting it. Below are a few comments on how I think the paper could be improved. First, the notion of consistency is over-emphasized a little bit throughout the paper. The very place where it is interesting and necessary is in the analysis of MAGIC - since bias is introduced due to "unsafe" use of model-based estimates (as opposed to the way DR uses it), we do want to see that in the limit the bias goes away. On the contrary, the consistency of DR and WDR are pretty much expected, since the consistency of IS/WIS/PDIS/CWPDIS has already been established. Furthermore, off-policy evaluation is a hard problem, and we often find ourselves in the regime of data insufficiency, implying that consistency may not be that much relevant to practice. I am not saying that establishing consistency is not valuable; I just mean that you could be more concise about it and leave space for more interesting contents, e.g., theoretical justification of the algorithms' finite sample performance. That said, I understand that providing rigorous finite sample analyses can be tricky in this paper. (For example, I am not aware of any bias/variance calculation for WIS, and I expect it to be difficult/complicated for WDR in a similar way, not to mention the difficulty of analyzing MAGIC given its complicated procedure.) Still, there are a few places where the author(s) might be able to better justify the finite sample aspect of their design. A particular place I find is the resolution to the "chicken-and-egg" problem that arises in the estimate of b_n (line 796); in fact, you can find a very similar problem in recent attempts to state representation discovery, and the problem is solved by a very similar technique, interpreted as a hypothesis test -- always assume that bias is 0 (null hypothesis) unless the empirical estimate of bias exceeds the estimation error. As far as I know, the technique was first used by Hallak et al. (KDD 2013) to show consistency, and Jiang et al. (ICML 2015) showed that it also enjoys adaptivity in the region of insufficient data. Your empirical experiments also show some sign of "adaptivity", especially in Figure 2(d): in the small sample region MAGIC is close to the model, and in the large sample region MAGIC is close to WDR. To conclude, I think the extension of WDR is relatively incremental, and I would hope that the contents on WDR could be compressed and more space given to the elaboration on MAGIC and theoretical justification of finite sample performance of the new estimators. Minor issues: - Line 229: When you say "the horizon is known/unknown", what exactly do you mean? Isn't horizon a part of problem definition? Of course, you can have an infinite horizon, and the difficulty that arises from such infinity is a separate issue. And, if L=infinity but trajectories are finite, Eq.2 cannot fix such a mismatch either. The experiments in this paper also have given finite L. Overall, I feel that the author(s) need to be a bit more precise in this paragraph. - Line 282: I am not sure if I agree that point estimates are more important than confidence intervals for hard estimation problems like off-policy evaluation. I feel that quite a few papers the author(s) cited are arguing in the opposite direction, and the relative importance should probably depend on application scenario. (After all this is an opinion thing and I won't worry too much about it.) - Line 482: The author(s) provides some intuitive explanation of why DR/WDR have high variance when domain has significant stochasiticity. For DR, this exact fact has been bluntly explained by Theorem 1 of Jiang & Li, which you can refer to as a justification. - Line 588: The notation in the first line of Eq (4) is a bit weird: the superscripts of IS^j and DM^j have different semantics, one referring to the steps before j and the other referring to the steps after j. Consider making the semantics consistent (like, IS^{1:j} + DM^{j+1:infty}). ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides several extenaions of existing off-policy evalution methods for approximating value function using behavior policies. The first extention is to remove the assumption on the fixed length of trajectory samples needed on doubly robust (DR) estimator proposed by Jiang & Li, 2015. The second extension is to introduce the normalized importance weight to the DR estimatior for more stable estimation. The third extension is to combine the DR estimator with a direct estimator which estimates the stationary state probability from all available trajectory samples, and control the effect of each estimator so as to minimize the approximation error of value function. The effectiveness of proposed extensions is demonstrated in gridworld problems---the third extension is shown to have the best performance in comparison with existing methods. Clarity - Justification: This paper contains several proposed extensions but it is not so clear what the main contribution of this paper is, since these proposals are explained step by step with redundant and lengthy descriptions. That is, the third extension which seems to be the main contribution in this paper, is finally explained in the section 8 after long introduction (including two extensions)---I guess that many of readers cannot wait until the section 8. Thus, the authors need to reorganize the paper for acceptance. Significance - Justification: The problem of off-policy policy evaluation has long history, more than 10 years, and its motivation is coming from real-world application of reinforcement learning (RL), where its sampling cost is inhitively expensive. From this point, I believe that off-policy RL should be evaluated in not only grid-world problem but also more realstic problems in which the sampling cost is high. In this sense, the results of evaluation are not signficant for me---I really don't get how useful the proposed methods in this paper are in recent real-world problems, e.g., website advertisement, medical treatment test and personalized curriculum as the authors introduce in Introduction. Of course, the view point of application is not always required in machine learning papers if strong and novel methodology is propoased. However this paper provides extensions of existing doubly robust estimation in RL and thus it is not largely significant as well. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1) The paragraph above section 5 is about model-free/based on RL? If yes, I cannot understand the sentense "The DR estimator is not purely model based, since it uses importance weights". Where is the model (state-action transition) in equation 2? =====