Paper ID: 984 Title: Differentially Private Policy Evaluation Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): See below. Clarity - Justification: See below. Significance - Justification: See below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper presents a differentially private method for policy evaluation associated with an MDP. In particular, the goal is to estimate a parameter that approximates value of a given policy for a given MDP. The corresponding problem is a least squares problem and the authors provide output perturbation based DP solution by using two methods: a) reformulation of the problem, b) adding a strongly convex regularizer to the least squares formulation. Comments: a. The work is incremental and essentially applies standard DP ideas to standard ideas from MDP. b. The problem is poorly motivated with no compelling application for studying MDPs in DP setting. In particular, no scenarios are provided where preserving privacy of individual trajectories of MDP is important. Experiments are over toy MDP with no real justification as to why preserving path of the given MDP is important. c. The proposed DP-LSW algorithm is vague and needs more explanation. It is not clear how w_s are selected, is there some relation of \rho_s to w_s? Is the proposed formulation standard in MDP literature and how does it perform over realistic problems? d. Once the strong convexity of the optimization problems are established (either through reformulation or by explicit adding a regularizer), then the algorithm as well as privacy/utility analysis (in terms of the objective function) follows directly from the existing work. It would be more interesting if the authors can provide more insights into how the objective function guarantee translates into guarantee for the estimation of the value function for certain specific instances. e. It is well known that objective perturbation tends to perform significantly better than output perturbation (Chaudhuri et al'2011). Was there any specific reason why output perturbation was selected in your setting? Summary: The paper provides a DP algorithm for valuation function estimation in an MDP; the problem is formulated as DP version of a standard least squares problem. The resulting algorithm and analysis is also standard and as such do not provide significantly novel insight into the problem. Finally, the problem itself is poorly motivated with no compelling scenarios where DP algorithm for valuation function estimation would be critical. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a differentially private policy evaluation algorithm. The goal of policy evaluation is to evaluate a fixed policy in an unknown environment (MDP), which is an important part in many reinforcement learning algorithms. The paper proposes two ways to deal with the non-unique solution problem in traditionally policy evaluation form, then gives two kind of output perturbation to obtain differential privacy. The paper gives both guarantees of privacy and utility. Clarity - Justification: The paper is well written. Significance - Justification: The paper proposes two ways to deal with the non-unique solution problem in traditionally policy evaluation form, then gives two kind of output perturbation to obtain differential privacy. The paper gives both guarantees of privacy and utility. The paper also provides new insight for reinforcement learning, which, in my point of view, is the main contribution of this paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper proposes a differentially private policy evaluation algorithm. The goal of policy evaluation is to evaluate a fixed policy in an unknown environment (MDP), which is an important part in many reinforcement learning algorithms. The paper proposes two ways to deal with the non-unique solution problem in traditionally policy evaluation form, then gives two kind of output perturbation to obtain differential privacy. The paper gives both guarantees of privacy and utility. Overall, the theory in the paper seems solid though it is very dense. The paper also provides new insight for reinforcement learning, which, in my point of view, is the main contribution of this paper. I think it is worthwhile to explore the using of DP property in reinforcement learning, for many applications of reinforcement learning or multi-arm bandit problem are user-centric (e.g. recommender system, personal healthcare, etc). However, I need to point out that policy evaluation usually is part of a reinforcement learning algorithm, collaborating with a learning mechanism (and an exploration policy). So it would be better if the author could analyze the effect of private consideration on the result of whole learning algorithm. Minor comments: - It would be better if the author could give more detailed illustration of the experiment, such as \omega _s in DP-LSW, \lambda in DP-LSL and how to determine them. - In section 3.1, it is said that \phi_X^\omega could be computed in time O(K_X N). What does N mean? - The method that DP-LSW used to avoid the non-unique solution problem seems a little bit tricky. Could the authors explain where it comes from and how to determine the value of \omega_s. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper consider the problem of policy evaluation for a policy for an MDP. The assumption is that the states of the MDP are (represented by) points in R^d. Given a set of sample trajectories from the MDP, one wants to learn a linear approximation to the value function. The paper deals with settings where the input comes from potentially sensitive data, such as browsing history of users. Thus one wants to ensure privacy, and thus the goal is to learn a linear approximation in a differentially private way. The most natural way to solve such a problem is to compute the global sensitivity and add noise proportional to that. This approach is likely doomed for this problem as the GS can be pretty large. This work shows how to apply the smoothed sensitivity framework to this problem. The second solution applies a regularizer to get better control of the sensitivity. The authors prove some utility bounds for these algorithms and empirically evaluate these on synthetic data from a simple MDP. The problem is a linear regression problem, made complicated in the private case by the fact that it is a weighted regression problem, and the weights themselves are dependent on private data. The first solution replaces these weights by their expectations, and applies the Smoothed sensitivity framework. The seocn done Clarity - Justification: The paper is well written. Significance - Justification: I find the problem to be of interest. Due to the complicated way in which the private data effects the problem objective, standard DP ERM algorithms don't quite apply. The authors show that the smoothed sensitivity framework can be applied and leads to interesting results. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): pg 3: main obstruction to stability: this concern is slightly exaggerated, as we know of at least a few problems where even though the solution is unstable in certain directions, it is in directions that don't matter (e.g. in PCA, the top eigenvector is unstable in the span of the eigenvectors with large eigenvalues). We know of techniques that can handle these problems privately quite well. =====