We thank the reviewers for their time and efforts. In the following we briefly address questions raised in the reviews.$
* Motivation *
Our principal motivation for studying differential privacy (DP) in reinforcement learning (RL) comes from medical applications. Learning and evaluation of dynamic treatments is a crucial part of personalised medicine and adaptive clinical trials. Privacy at the level of trajectories is a natural requirement in these applications because each trajectory represents the evolution of a single patient during their treatment, and might include information such as the sequence of medications given, as well as possible side effects. We will include this example in the introduction to clarify the motivation of our work.

* Novelty *
Policy evaluation and private ERM are indeed well-studied problems, but combining the two in a meaningful way is not straightforward. In particular, we don’t think our privacy and utility analyses based on the smooth sensitivity of optimisation problems follow directly from existing works for several reasons. First, the stability of policy evaluation methods has not been thoroughly studied, and the perturbation model in which a full trajectory is replaced is different from the usual one used in ERM problems where a single regression target is replaced, because information of several, correlated examples is changed (see Section 7 in the paper for a more thorough discussion). Secondly, the typical output and objective perturbation techniques for ERM are based on global sensitivity, while for policy evaluation it is necessary to use the smooth sensitivity framework to get good utility.

* Output vs. Objective Perturbation *
It is known that for least-squares objectives it is possible in most cases to translate directly between objective and output perturbations. We will modify the paper to clarify this and add a paragraph explaining how to interpret DP-LSW and DP-LSL as output perturbation methods.

* Derivation and Parameters for DP-LSW *
The derivation of DP-LSW is motivated by the similarity between equations (5) and (7) in the paper. In (5) a source of instability in the solution is the dependence of the regression weights on X thorough Gamma_X. In (7) this is replaced by taking the parameters w_s as input, which can be used to influence the relative importance of the value estimate in different states. They play a similar role to the rho_s in DP-LSL, but we gave them a different name because as described in line 340 there is a choice of w_s that mimics the asymptotic behaviour of LSL with given rho_s. If no prior knowledge is available, the w_s can be chosen to be uniform (as we do in our experiments), and Corollary 12 in our utility analysis shows that DP-LSL behaves nicely if these weights satisfy a mild condition. 

* Relation to RL *
Our final goal is to design DP algorithms for the whole RL problem. Policy evaluation is an important sub-block, and by invoking the compositional properties of DP one can see that our algorithms can be plugged into any RL algorithm and make the policy evaluation and anything derived from it private. For example, one could use our algorithms to derive a DP version of Q-learning by estimating a state-action value function instead of a state value function. We will investigate this and other alternatives to achieve DP RL algorithms for the control case in future work (eg. combining this policy evaluation approach with ideas from differential privacy for bandits).

* Further Clarifications *
- The parameter N in the complexity of the algorithms is the number of states in S.
- We will include a paragraph in the experimental section describing how the different parameters of the algorithms can be selected in practice.