We thank the reviewers for their comments.$ Reviewer 2: - On sample complexity: The question is non-trivial when the underlying function is not convex and there are no peer-reviewed published results on this, as far as we know. The question nevertheless is interesting. For convex objectives, finite time bounds could be derived, but assuming CPT-value is convex is not realistic. - On the experiments: "if we want to optimize CPT-value, we should use a specialized algorithm": We believe it is a nontrivial practical requirement to have a specialized algorithm designed, as surrogate losses have been shown to exhibit good results in other contexts. Our experiments show that it is worthwhile designing a specialized algorithm, which in turn justifies the whole work and acknowledges CPT criteria is quite different from EUT and AVG. From the point of view of laying foundation for using CPT in a learning setting and also that there is no prior work/competing algorithms, we believe that the empirical evidence we provide is essential. - On "miss interesting structures": Good point. We were unable to identify any interesting structure for CPT+RL, but know that there is no Bellman equation in general. We believe obtaining structural results would require additional assumptions, which may not be well aligned with the principal ideas of CPT. - On the title: Agreed. We shall modify the title accordingly. Reviewer 4: - We agree with the suggestion that the paper be positioned as optimizing CPT-value in an "active learning" setting and treat RL only as an application. As you pointed out, the technical content (Sections 2, 3 and 4) is already geared towards optimizing CPT-value in the "active learning" setting. The introduction will be updated accordingly. Reviewer 6: - On CPT parameters: We assume that the *solver* knows these parameters in advance (estimated previously by other means). Behavioral scientists have proposed specific choices for weight and utility functions and have shown via experiments involving human subjects that these choices work across multiple domains and realistically capture human preferences. For instance, there are several papers which conclude that the weight function is inverted-S (cf. pp. 348-349 and Section 5.1.1 of (Starmer, 2000) reference in our paper). We acknowledge that robustness to unknown parameters should indeed be analyzed in a sensitivity study. We plan to do this after covering the base (i.e., what we do in the present work). - We believe traffic signal control is a major application with significant interest (indeed, most major cities are involved in this task). The current setup fixes the routes and optimizes the delays in the CPT-sense, while your suggested application does the reverse. We believe "giving driving directions" is another interesting application for experimenting with CPT. - On "Compared to expected utility... less universal": We believe it is the opposite as CPT encompasses expected utility (EU). EU is appealing mainly for theoretical reasons, as it is simple and easy to optimize. Yet, behavioral science shows major shortcomings of EU in explaining human preferences. - On "Does the theory of CPT say whether utility is accumulated first?": Intuitively, humans in a given task care about the final outcomes more than the individual rewards. For instance, consider the "giving driving directions" example: Humans view driving, say to workplace, as a single task. If the overall driving time has a 'nice' distribution, they are happy and may not care about individual delays so much. This setup has actually been investigated previously [1]. On a different note, the paper has contributions beyond a sequential decision making setting - the technical content (Sections 2, 3 and 4) are already geared towards optimizing CPT-value in the "active learning" setting that is not necessarily sequential. RL can be treated as an application where CPT is applied to the return. - On "We do not know the distribution": By CPT-value estimation we mean the following: Design an algorithm that does not know the distribution of X, but can obtain i.i.d. samples from it and also knows the utility and weight functions. We will add this clarification. Minor comments: "quantile-based estimator": CPT-value definition involves integration of the distorted distribution, which in turn implies that one requires a good estimate of the entire distribution and hence, quantile-based approach makes sense. "reference point": 'Status quo' is suggested as a good baseline in the original CPT paper, and is often used in the literature. We acknowledge that choosing a reference point may need application specific considerations. In our experiments, we chose the delay given by a pre-timed controller as the reference, since it is the de facto standard. References: [1] Gao et al. "Adaptive route choices in risky traffic networks: A prospect theory approach." Transportation research part C, (2010).