Paper ID: 312 Title: Doubly Robust Off-policy Value Evaluation for Reinforcement Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies the problem of off-policy value evaluation, which differs from off-policy evaluation in the fact that the value function needs to be estimated on average over an initial state distribution. Apart from this difference, the setting is as usual: a behavior policy collects a dataset which is then used to evaluate the quality of a target policy. The authors extend the doubly robust (DR) estimator, which has been previously adapted to the simpler case of contextual stochastic bandit, to the sequential case. In particular, an independent estimate of Q-value function (and of the V-function as well) is provided and it is used to "correct" the estimator. The authors provide a decomposition of the variance of the DR estimator in terms of the accuracy of the side Q-function and the intrinsic randomness of the state transitions. An extension of the Cramer-Rao lower-bound shows that when the estimate Q is indeed exact the DR estimator achieves the smallest possible error. The empirical analysis compares DR with importance sampling and weighted importance sampling and regression (ie model-based methods). Clarity - Justification: The paper is overall well written. The only part which is not very easy to follow is section 5, where it is not that clear to which extent the tree MDP (and the introduction of observations and histories) is actually needed to actually provide the result. Significance - Justification: Off-policy evaluation is indeed a very important problem and providing low-variance estimates has the potential for many practical applications. The technical contribution of the paper is relatively limited, since the extension of the DR estimator from contextual bandit to RL is relatively simple. Nonetheless its analysis is not trivial, the extension in section 4.4 and the derivation of the lower-bound make the contribution solid and overall interesting. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1- In the derivation of Theorem 2 the authors introduce the discrete tree MDP model. I am not sure I understand what is the need for this model and why the result would not hold in the case of a general MDP. This point may need further justification/clarification. 2- In 6.1 using "train" for generating the original policy is a bit misleading because we may expect that "train" are the samples used to train the estimator and "eval" the samples used to evaluate it. Please clarify the fact that the results are computing the "true" error and not a testing error. 3- The results in Fig1 and Fig2 are overall a bit disappointing. While it is true that DR performs better than IS and WIS, REG is almost always the best choice (meaning that the MDP approximation is quite accurate) apart from the most on-policy case where \pi_1 only uses 0.25 of \pi_train. 4- The empirical validation is quite extensive and it covers different scenarios and uses of the DR estimator: from simple policy evaluation to actual policy improvement. Furthermore some "realistic" MDP is also used in the KDD experiment. Recommendation ================= While the technical contribution may be relatively limited, I think the paper provides a sound extension of DR to the sequential setting, it provides a thorough theoretical analysis and empirical validation w.r.t. the state-of-the-art. For this reason I propose weak accept. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper paper opposes a doubly-robust estimator for off-policy evaluation of RL policies. They show that the estimator is unbiased and has low variance as compared to standard importance sampling (IS) estimators. They also provide a lower bound on the variance of the estimator. Experiments on 3 problems, two synthetic and a real one show that the approach is better than the IS. Clarity - Justification: Overal, the paper is well written. But it could improve significantly. The major equations (see detailed comments) need to be explained better. The proof could also use better explanations. Significance - Justification: The paper is an incremental advancement of the DR estimator in bandits to a DR estimator in RL. Results are reasonable but it does not mean that one could not have the same results with different techniques (see detailed comments) Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The requirements for independence need to be explained in more detail. Why the fitted R function needs to be independent of the data used in equation 8? Why is it OK for the target policy and fitted reward function to be dependent? The authors need to derive equation 10. An expansion of the formula seems to have 3 major sums $V_{DR}^{H+1-t} = \sum_{t=1}^H \gamma^{t-1} \rho_{1:t-1} \hat{V}(s_t) + \sum_{t=1}^H \gamma^{t-1} \rho_{1:t} r_t - \sum_{t=1}^H \gamma^{t-1} \rho_{1:t} \hat{Q}(s_t, a_t) $ The first in the IS of the value (based on the model)wit a lag. The second is the IS of the reward received The third is the IS of the Q value. It obviously combines the actual reward and the model, but each term should be clearly explained as to what it captures. The authors explain why the estimator is unbiased and has less variance in the proof of theorem 1 in the appendix. The proof has some parenthesis problems. Not enough explanation is given on the various steps. The authors should explain plainly each term of the variance equation in Theorem 1. The second DR estimator (equation 12) was motivated by the fact that even if a good model is obtained the variance will still be high. The estimator requires the true transition model, which is impractical. Since this is a good model, the ratio of the transition models in the equation is close to 1 and can be removed. How does the remaining equation compared to equation 10? It seems than rather than using the expected value of the next state it simply uses the state according to the data. Why does this estimator have less variance? For Observation 1. you say that the variance of DR is equal to the lower bound when Q=\hat{Q}. In a way you say it cannot be better than the variance of the model. But later, you say that this is because of the variance in transition dynamics. How about variance in the reward function? In the experiments, where does the D_eval data come from? Is it from \pi_0 Figures 1 and 2 can be misleading. The IS comparisons should be done with all the data in D_eval. Why in Figures 1 and 2 the DR does not do as well as REG when the the IS estimator is bad (0 data points are assigned to it) The experiment in Figure 3 could also be thought of as a negative result. When the model is good we can just use REG. When the behavior policy is similar to the target policy then we can just use WIS. Why go through the extra effort of doing both? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper suggests extending the doubly robust estimator to the RL problem of off-policy evaluation. The authors describe the estimator and analyze its variance. In addition, they show a hardness result about the problem of off-policy evaluation and show that DR achieves this lower bound under some assumptions. Clarity - Justification: While the paper was very readable and I found no typos or errors, I feel some subtle points were missed or should have been granted more space (see detailed comments). In addition, even though DR might be an old statistical method, I didn't know it before the paper and the paper gave me little to no intuition beyond it. In my opinion, some experiments could definitely be cast aside to allow more space for these issues. Significance - Justification: Despite being the main subject of the paper, by my understanding the doubly robust estimator and analysis has some issues that make it a somewhat problematic contribution on its own (see detailed comments). What I did like a lot is the hardness result (Section 5) - elegant proof, nice conceptually, even with all the restrictions. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Even though I think the paper is generally good, I've had several issues that I think the paper would be much less impactful if ignored. 1. The DR estimator seems to heavily rely on a good estimate of Q - it reaches the lower bound, and both summands its comprised of are less biased. But if you have a good estimate of Q, can't you just evaluate V^ as r+Q^? If this estimator is what you call REG in the experiments, it seems to generally perform better, so why bother with DR? this is one of the subtle points I was relating to. 2. Related to the previous point, DR might achieve the hardness bound when Q=Q^, but in that case won't V^=r+Q^ surpass the CRLB with var[r] estimate? I might miss something here but the great reliance on Q seems problematic to me. 3. As far as I understood the idea beyond DR is to handle cases when rho is unknown as well. But these cases are usually less considered in the off-policy setting. Is this the setup in which DR shines? if so, maybe its better to swap rho^ for rho and try to work with that, though I suspect much more data will be needed to pull-off experiments in that case. Either way, this is another one of the subtle points I feel are missing explanations. 4. Regarding the hardness result - like I said it looks impressive and I like the proof, but it might be a bit tiresome to include this proof in the paper. Instead I would have loved to read about why the proof can't be extended to general MDPs (without the restrictions), and a dumbed down explanation on the difference between discrete tree MDP and a regular MDP. In summary, while the paper in my opinion does contribute, without relating to the points above I'm not sure researchers would know if-and-when to use DR. The CRLB bound looks interesting enough, but as mentioned in 2, I find its specific usage here problematic. =====