Paper ID: 470
Title: Estimating Maximum Expected Value through Gaussian Approximation

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an alternative to Double Q-learning that uses a weighted estimator, based upon an assumption that the mean has a Gaussian distribution. 

Clarity - Justification: The paper is nicely written. The main limitation, vs. double Q-learning, is that this method seems tied to tabular value function representations. The authors state this in the last sentence of the paper, but it seems like an important enough limitation that it could be stated earlier on.

Significance - Justification: It's nice to see a different approach to this question. The significance may be limited at this point by the connection to tabular value function representations.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): One thing that comes to mind is some really old work by Russell and Wefald on MGSS*. If I remember correctly, this work tried to estimate the distribution over the max in search trees and then tried to decide which nodes to expand based upon this distribution. That previous work does not diminish the significant of the submitted paper, but could be an interesting connection to make.  Typo: OWEare -> OWE are  Some further discussion of OWE vs. WE might be helpful. It may seem counterintuitive that OWE has higher bias than WE. It would be good to discuss in what sense OWE is optimal.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a new estimator for the maximum between the expected value of a set of random variables. The bias and variance of this new estimator are analysed. This is an important problem in reinforcement learning to correct the known bias in algorithms using the naive max estimator such as Q-learning. 

Clarity - Justification: Clearly written paper, and the material is presented in a sensible order. 

Significance - Justification: The empirical results are in small-scale domains but somewhat promising. The proposed estimator sound reasonable but I'm left wondering whether the assumptions (independence, normality) will prevent this estimator from working in large-scale sequential domains.   (Bibliography seems incomplete so it is difficult to fully assess the novelty of this approach, others (Powell et al.) have looked at correcting the bias with similar assumptions)

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proposed estimator (called Weighted Estimator) is a weighted sum of the sampled means. The weights are computed based on a Gaussian approximation of how likely a sample mean will be greater than the others. For example, this has the effect of correcting the introduced bias of one sample mean dominating the others by chance when it is obtained from a small number of samples.   I think Section 3 could be expanded a bit to provide some more intuition about this new estimator. For example by showing how the number of samples for each set affects the variance and how it then affects the WE.    Some comments/questions:  * While the authors duly compare the approach to double Q learning, they have ignored other work such as "Bias-Corrected Q-Learning to Control Max-Operator Bias in Q-Learning" which would seem important to mention and compare against.    * It is not clear whether the proposed approach is fully justified in an MDP setting, where the independence assumption between the value of different actions breaks. What would be a worst-case scenario for your estimator in that setting?  * In the experiments, how were the hyperparameters selected? In particular the learning rate. Double Q-learning might require a larger learning rate than Q-learning because the errors are often smaller on average.   * In the forex example, why are we comparing different exploration policies, and only the W Qlearning is allowed a better exploration policy. Without this, it seems W Qlearning actually performs worse than Q learning. Is it the case that the overoptimism of Q-learning actually helps for exploration in this domain? In any case, this isn't a very convincing example because of that confound.  * Looking at examples with function approximation, as suggested in the discussion, would help to convince the reader that the WE is useful in an RL setting.   Minor comments:  * line 139: founding  --> finding, forming? * Hado van Hasselt is cited either as Hado or *V*an Hasselt, both of which are incorrect. * Equation 3:  f_i(s) --> f_i(x) ? * line 175: This is the reason why...   It's unclear from this alone why the bias is positive, maybe worth explaining * line 188: The mechanism of the DE is not really explained here. You should say how the estimator is picked based on a*. * line 356:   no -> not, AS, OWEare * line 410: Theorem 4 -> 3 * line 531: for compute --> to compute * line 595: repeats maximizing * Algorithm 1: Poor notation for the weights, why not use the notation introduced in the paper? * line699: 3x3     x --> \times   

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper develops a new estimator, WE, for the maximal mean over a set of distributions. This estimation problem is an essential building block to many decision making problems in machine learning (in particular, reinfocement learning). Thorough bias/variance analysis and empirical evaluation on bandit/MDP problems are done to compare WE to existing estimators.       

Clarity - Justification: It's hard to extract take-away messages from the analysis. The MDP related contents have some blurry points.

Significance - Justification: The new estimator is well justified in the large sample region, but real-world challenges usually place us in the small sample region, which the paper does not address enough. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The problem of estimating the max over means is commonly found in bandit/RL scenarios, and the paper proposes a new estimator with a thorough theoretical examination of its bias/variance compared to existing methods. A general concern is, how well the method works when data is insufficient and central limit theorem does not apply? The paper claims that the estimator finds important use in reinforcement learning, but in a real-world RL problem we often face data insufficiency, hence it is crucial to understand how the estimator performs in the small sample region both empirically and theoretically. Perhaps little could be said in theory, but the experiments could have explored small sample setting easily, which I didn't find in the paper; the MDP results may be free of this issue but they have some separate problems (see below).  My second major concern is on the MDP results. In Algorithm 1, the estimated Q-value for a pair of (s,a) is not a plain average over some i.i.d. samples; how is the weight computed in this case? This is a very important detail; is it specified somewhere in the paper? On a related note, Q-values have a recursive definition in Bellman Equation, hence Q(s,a) for the same s and different a are correlated with each other in a way much more complicated than the i.i.d. picture depicted in the theoretical sections. I am aware that the Double Q-learning paper has suggested this approximate viewpoint, but they are probably more refrained from the criticism due to their simple treatment, while your solution makes heavy use of the i.i.d. nature of the variables (via central limit theorem).   A final major comment is on presentation: all the result sections need explicit summary of the theoretical/empirical findings. For example, at the end of Section 4, I am desperately expecting a paragraph concluding what I should take-away from all the theorems above. I have similar feelings in Section 5 (lack of sharp sentences making the privilege of WE explicit).  Below are two concrete technical comments:  1. The text in Sec 3 confuses me. It argues that ideally the weights should be P(hat_mu_i = max); isn't the ideal weights 1 over the true maximum and 0 everywhere else (which of course is unrealistic)? My understanding is, OWE is a denoised version of DE: using the notations in Eq(4), DE uses dataset A to sample an item from the distribution P(hat_mu_i = max); OWE takes expectation w.r.t. this distribution instead of having a noisy sample from it, and gets a similar mean as DE but enjoys reduced variance. From here I can conclude that OWE is better than DE (and so is a good approximation of OWE, hopefully), but I don't understand why it is necessarily "optimal".  2. Sec 5.2 shows that W-policy is useful as it induces some kind of "clever" exploration (this is my feeling upon reading the sentence in line 764). However, there is also the possibility that W-policy performs better simply because it results in more plain exploration, and every other algorithm will also be improved when given more plain exploration (e.g., in 5.2.1, multiply epsilon by some constant > 1). You may need to show some empirical results to rule out such a possibility.  To summarize, I think the paper needs to address the small sample characteristics of the estimator, and be more precise about the MDP related contents.   Minor issues: - Line 356: two typos, "AS" and "OWEare". - Line 423: DE has the largest variance because it only uses half data -- doesn't "swapping the role of A and B" (Line 85) remove this weakness? - Sec 5.1.1 has at least 10000 samples per variable, which seems a lot of data. Consequently, MSE in Fig 5 is in the order of 10^-5. Given such a small scale, is the relative performance of different estimators really important? - Your 4th bib entry probably has the wrong order of last/first names.

=====