We thank the reviewers for the helpful suggestions to improve exposition, and we will incorporate them in revision. We address common misunderstandings before responding to individual comments.$ [R1,R5] Relatively limited technical contributions? Two important technical contributions are made: Algorithmically, as far as we know, Eq.9 and its insight are novel: stepwise-IS was originally proposed as a variance-reduction patch for IS, but Eq.9 reveals a direct, new connection to bandit IS, which naturally leads to our MDP DR estimator (Sec 4.1) and an extended version (Sec 4.4). Before arriving at Eq.9, we had studied several other seemingly promising ways to extend bandit DR, all of which are either biased or inefficient. Another major, theoretical contribution is the lower bound. On one hand, we developed nontrivial proof techniques. On the other hand, the bound quantifies precisely the fundamental difficulties of off-policy evaluation in the sequential setting. Such hardness results also have important implications in other RL problems like model selection (cf. thesis of Farahmand). [R1,R4,R5] Why not just REG? The good performance of REG is in a sense an artifact of the well-understood benchmarks; in complex, real-world applications, learning an accurate model with function approximation (FA) is challenging. Furthermore, it is practically difficult to verify or falsify whether a particular FA is appropriate for the problem at hand (line 233), hence fundamentally limiting the credibility of REG. In contrast, DR always maintains the nice unbiasedness guarantee, like IS, no matter how poor the Q^ estimate might be (eg, DR-bsl in Figs 1,2); if we happen to build a reasonable model, DR *automatically* takes advantage of it. Thus, we recommend DR over IS and REG for most problems. Furthermore, one can always have IS and variants outperform REG with different experiment settings. For example, one may use enough evaluation data as REG has non-vanishing bias (cf. Thomas et al., AAAI, 100,000 for M-Car; we use 5,000); one may also reduce trajectory length (cf. thesis of Thomas, 30 on average; ours are mostly 100). Compared to these state-of-the-art results, DR impressively outperforms REG in long-horizon problems with limited data. [R4,R5] Why need Tree MDPs? While technically they are special cases of MDPs, they make much weaker assumptions about the environment, therefore improving credibility of evaluation results (lines 448-451). In particular, they capture a very general, partially observable RL setting (typical in medical, dialog, education, and many other applications) where states are represented by the interaction history. That said, we agree it is an interesting open problem to derive similar lower bounds for usual MDPs; they are useful in situations where we are comfortable with the Markov assumption. A first step is taken in this direction; see our DAG MDP results in appendix. The difficulty in the case of regular MDPs is that a state can appear at any time step; this complicates calculation of the C-R lower bound. We conjecture that the ratio in Thm 2 (appendix) would be replaced by some average over all steps, but the precise form remains open. [R1] Why DR-v2 reduces more variance: in the ideal case of Q^=Q, DR-v2 eliminates the variance due to state transitions, exactly because we use the next-state recorded in data. Variance of reward in Obs 1: in the simplification of tree MDPs we assume that reward only occurs at the end of the trajectory and its randomness is encoded as an additional state transition (line 464), so there is no reward variance. D_eval: yes, it is generated using pi_0. Why DR≠REG when no data for IS: DR’s final estimate is the average of V_DR (Eq.10) over a subset of data (which IS also uses). When all data go to REG, it has a good Q^ but there is simply nothing to average over. Why not WIS in Fig 3: the idea of WIS does not conflict with DR, and we are aware of follow-up work that combines them into “weighted DR”. [R4] 1. See “why not just REG”. Btw, the correct way to convert Q to V is Eq.3, not r+Q. 2. If Q^=Q, the variance of V^ (Eq.3) is indeed 0. However, there is no estimator that gives Q^=Q for *all* problem instances in a general family of MDPs. In practice we encode strong assumptions in FA (line 446), so that Q^ given by REG is close to Q on a subset of instances where the assumptions hold, and DR approaches the lower bound if the true MDP lies in this subset; if not, DR still enjoys unbiasedness. 3. Yes, DR helps when rho is unknown, but it also helps reduce variance. We focus on the latter because high variance is the central issue of IS in RL (exponentially larger than in bandits). [R5] 2. Sorry for the confusion and we will try better names. The estimators are “trained” (as the reviewer called) on D_eval, and evaluated by comparing the estimates against the true value of the target policy.