Paper ID: 680
Title: Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides an algorithmic foundation for being able to do sequential decision making (RL) in the cumulative prospect theory (CPT) framework. CPT has proven to be a much better model of human preferences than classical expected utility, so RL tasks that involve decisions that optimize based on human preferences could be a good fit for this framework. 

Clarity - Justification: The motivation and mathematical development are very solid.  However, my major concern about clarity centers around the question of how I should think about all this. Typically, CPT is used to model the preferences expressed by a person and there are many free parameters to fit. Are we imagining the solver *knows* these parameters in advance? If not, where might they come from? If so, what justifies the use of any particular parameter setting?  In the computational example, "average" preferences for a population are used to devise a control strategy for traffic lights that maximizes preferential outcomes. Is that the main use case? Or, should/would it make sense to imagine applications like driving directions where the choices are made with respect to the specific user's preferences?  Compared to expected utility, CPT seems much more idiosyncratic and less universal. If you want to argue that CPT is useful in sequential decision making, these issues need to be addressed.  (For what it's worth, one reason that CPT hasn't found more use in the RL community is this issue. I know of at least one attempt to bring CPT to bear that never saw the light of day because this issue was unresolved.)  In addition, it's not even clear that an MDP would have a coherent interpretation under in a CPT regime. In particular, the paper says "we apply the CPT-functional to the return of a policy." What justifies this particular viewpoint? Does the theory of CPT say whether utility is accumulated first? In truth, a policy induces a random *sequence* of rewards. When humans are asked to choose amongst outcomes sequentially, their preferences can exhibit temporal non-stationarities. Have we ruled such preferences out by applying CPT transformations to accumulated reward?   "However, we do not know the distribution": Who are "we"? The algorithm designer? The analyst? The learning algorithm? The human user? This issue is related to the earlier one---the use case is somewhat unclear.  

Significance - Justification: At this stage, it's difficult to tell. If the philosophical/methodological issues can be addressed, then this paper could be a seminal work providing an algorithmic framework for applying CPT to real world problems. If they can't, it's just some interesting math. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): "enterprises or hard" -> "enterprises work hard"?  "In this paper, we consider a natural quantile-based estimator and analyze its behavior.": What is the justification for choosing this estimator?  "of a the" -> "of the".  "the inverted-s": Wouldn't that be a reversed s? (s is the same when inverted.)  "smaller the global" -> "smaller than the global"?  "captures realistically captures" -> "realistically captures"?  "This is similar to EUT, except that no distinction between gains and losses via utility functions nor distort using weights as in CPT.": Grammar issue?  "An important recommendation of CPT is to employ a reference point to calculate gains and losses.": Indeed. Do we need per-application guidelines for choosing these reference points?  "Figure 4. Histogram": Perhaps put on same scale to aid in comparison? 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The contributions of this paper are:  1) Connect the ideas of cumulative prospect theory (CPT), simulation-based optimization and reinforcement learning (RL);  2) Propose an algorithm to estimate the CPT-value based on order statistics (Algorithm 1), and discuss the asymptotic convergence and sample complexity of the proposed algorithm under various technical assumptions (Holder continuity, Lipschitz continuity and locally Lipschitz continuity), see Section 3;  3) Propose a gradient-based algorithm for CPT-value optimization (Algorithm 2), and prove its convergence under mild technical assumptions (Theorem 1), see Section 4; a second-order CPT-value optimization scheme is discussed in the appendix.  4) Demonstrate preliminary experiment results, see Section 5.

Clarity - Justification: The paper is well-written and easy to follow.

Significance - Justification: I think the paper is significant since it connects the ideas of cumulative prospect theory (CPT), simulation-based optimization and reinforcement learning (RL). I think this is a very promising direction and hopefully this paper will motivate more research works along this direction.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The contributions of the paper are summarized above. In general, I think this paper is significant since it connects the ideas of cumulative prospect theory (CPT), simulation-based optimization and reinforcement learning (RL). I think this is a very promising direction and hopefully this paper will motivate more research works along this direction. This paper is well-written and easy to follow. Thus, I vote for accept.  However, I vote for "weak accept", rather than "strong accept", for the following reasons:  1) This paper only proves that the proposed algorithm (Algorithm 2) is convergent (Theorem 1); there is no sample complexity (convergence rate) result. I think the authors should either provide a sample complexity result, or discuss why sample complexity analysis is challenging in this case.  2) The simulation result is not interesting: since EUT-SPSA and AVG-SPSA have different objectives, it is not surprising that they perform worse if measured in CPT-values. So the only thing we have learned from the experiment is that "if we want to optimize CPT-value, we should use a specialized algorithm", which is not surprising and not so interesting.  3) A minor comment: The title of the paper and the contribution of the paper are somewhat misaligned: the title says this paper connects CPT and RL; however, the major contribution of this paper is about how to optimize CPT-value via simulation-based optimization. I fully understand that simulation-based optimization can be used in RL, as shown in this paper, but still feel there are some misalignments, for two reasons: (a) obviously simulation-based optimization can be applied to problems other than RL; (b) I feel that if we think of RL in this way, we might miss many interesting structures in the CPT-RL problem.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper integrated cumulative prospect theory (CPT) into reinforcement learning (RL) setting, which can align the result with real preferences of human beings theoretically. First step is to estimate the distribution, and in this paper the authors use the empirical distribution. Then the proposed CPT-value is estimated with a quantile-based estimator. Finally, simultaneous perturbation stochastic approximation (SPSA) is used in the simulation optimization. Both theoretical guarantees and simulation experiments are given. 

Clarity - Justification: Suggestions: 1. Please explain Fig 2 more clearly, if it is a necessary illustration. 3. Add small portions of review on different techniques used in different stages, at least explain why a specific algorithm is chosen in the problem.

Significance - Justification: Generally, this paper incorporates the CPT into the RL framework, and used a series of techniques for solving different problems in different stages in RL. It is a colossal problem in which each stage needs to be dealt with carefully. The take-home point in this paper is to consider CPT in the RL framework, but in the general utility theory, the measure of the preference is more qualitative, not quantitative, which makes the behavorial economics per se like a paradox.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See above

=====

Review #4
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces the idea of Cumulative Prospect Theory (CPT) to the active learning setting, which can be used to better model human preference in such domains. It introduces a numerical scheme to estimate the CPT objective and introduces an algorithm based on Simultaneous Perturbation Stochastic Approximation (SPSA) to optimize a parameterized solution. 

Clarity - Justification: The paper is clearly written and is overall easy to follow. There are plenty of non-trivial references to the appendix that should have probably been included in the main text. I did not check all the proofs in the appendix, but the theoretical analysis is sound. 

Significance - Justification: The use of CPT theory in optimization is not novel to this work, but the treatment of more general model-free settings without the assumptions on the nested ballman operators adds to the related literature. However, I strongly object to the framing of the work as it relates to RL. The methods and algorithms introduced in the paper treat the problem domain as an “active learning” setting, where one can only try a solution parameter and observe (several) stochastic outcomes. Even the gradient approach is a generic active learning mechanism without any trace of the sequential nature of a generic RL domain.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I believe the paper adds an interesting paradigm to the literature of active learning and thus is relevant and significant for the ICML audience. However, IMHO most mentions of RL in the paper should be replaced with the more appropriately related setting of active learning.  It would be interesting to use the sequential nature of the RL domain, along with importance sampling methods on off-policy trajectories to estimate the gradient of the numerical CPT approximation without drawing new samples.  Minor notes: - A small notation table would help readability. - Line 321: Mention if the optimal CPT policy is stationary. - Line 538: a -> \rho. Also in the appendix. - LIne 569: biased -> asymptotically unbiased. If this estimator is biased, it was not discussed clearly in the text. 

=====