We want to thank the reviewers for their insightful comments and helpful suggestions for improving the paper.$Here we report the answers to common questions and then we address individual comments. MDP-AR6,AR8 The application of the proposed approach to the Q-learning algorithm shows that estimating the maximum using WE is effective even when its assumptions are not satisfied (samples are not i.i.d, q-values estimates vary with time and action values are not independent). The choice of applying our estimator to Q-learning, is due to the decision of testing WE on the same settings that were used to test the DE. Nonetheless, other RL algorithms, such as Fitted Q-Iteration, do not present the issues mentioned above, making the use of WE better justified. OWE-AR7,AR8 Actually the word optimal is not appropriate and misleading. The AR8's interpretation about OWE is correct. The difference between OWE and WE is that the former computes the probabilities using the true means and the true variances, while the latter uses sample means and sample variances. We will replace the word optimal with distribution-aware. WQ-learning-AR6,AR8 Since WE outputs the probability of each action to be the best one, it seems quite natural to use such probabilities in the action selection mechanism. This approach effectively tackles the exploration-exploitation tradeoff, since it favors the selection of actions that either have large values or have large variance. Since such probabilities are computed only by WE, we have not used this exploration strategy with the other estimators. As suggested by AR8, we have performed tests with epsilon-greedy policies with larger epsilon values. The results show that with more exploration the maxQ estimation of DE improves, but is still worse than the WE one and the performance gets significantly worse. In conclusion, the weighted policy works effectively without requiring any parameter tuning. AR6 We really thank the reviewer for pointing out the paper of (Lee et al., 2013), that we will definitely consider in the final version. They propose a modified version of Q-learning that, assuming that the rewards are Gaussian, corrects the positive bias of ME by subtracting to each Q-value a quantity that depends on the standard deviation of the reward and on the number of actions. We do not believe that this work actually reduces the novelty of our proposal, since the approaches are significantly different. The main differences are: - the correction term introduced by (Lee et al., 2013) only depends on the number of available actions in the state and the standard deviation of the Gaussian reward of the updated state-action pair. Actually, the amount of bias does not depend only by the number of actions and is significantly affected by the actual distributions of the different actions. Their choice may be justified in problems where the action values have (nearly) the same expected value; in fact, the only domain they considered was the roulette one. - their approach assumes the rewards to be Gaussian, while our results hold asymptotically for any distribution. For all the algorithms we used the same learning rates. Since we have not performed hand-tuning, all the algorithms could benefit from different choices of the hyper-parameters, but we do not think to have unfairly favored one algorithm at the expenses of another one. The better performance of Q-learning wrt the other estimators in the Forex experiment is motivated by the optimal action being significantly better than the other ones, that is a best case for ME (as shown in Section 4). AR7 As suggested by the reviewer we will anticipate that the proposed estimator is meant to work with a finite number of random variables, thus requiring tabular representations when used to solve MDPs. Nonetheless, there are several ways to extend WE to continuous MDPs. One possibility that we have already developed consists of approximating the Q-function through Gaussian Processes, thus modeling the uncertainty about the Q-values. For sake of space we decided to focus on the core idea, leaving to future works the extension to continuous domains. We thank the reviewer for suggesting the connection with MGSS*, that we will definitely consider for the camera ready. AR8 We agree that, for practical purposes, the estimator for the maximum needs to perform well also when a few samples are available. In Figure 5a, we show the effect of the number of samples on the accuracy of the different estimators. It can be noticed that, as the number of samples decreases, the difference between DE and ME reduces, while WE remains significantly lower than both the other approaches. The weights in Algorithm 1 are computed using the integral component in Eq.(6), where the distribution parameters are computed over the Q-values. We will better detail the algorithm in the final version.