We thank the reviewers for careful reading and encouraging comments. $
General responses:
>On parameters alpha and beta of CW/ECW-RMED
First, parameter alpha urges the draw of each pair for o(log T) times and is used for assuring the quality of the estimator \hat{\mu}_{i,j}. Such quality-assuring parameters are often used in some variants bandit problems (e.g, 8log n in UCB1-NORMAL [Auer+2002], log(log n) in CKL-UCB [Magureanu+2014]).
Second, we apology for a mistake: the true value of beta is 0.01 (not 0.1) throughout all experiments in this paper, which means that the exploration by Step 4 occurs more infrequently than in the algorithm with beta=0.1. Anyway, the algorithm is not sensitive to the choice of beta as long as it is small. In fact, we experimentally confirmed that setting beta=0 yields almost the same results as the ones in the paper. Based on this observation, we conjecture that parameter beta is a theoretical artifact. Technically, beta is required for bounding the regret when the quality of the estimation is low (equation (32)).
In summary, positive alpha seems to be necessary theoretically and practically, but its influence to the regret is easily controlled.
A large beta sometimes causes enormous regret but a very small beta or beta=0 is practically sufficient.
>Theorem 3 and 4
We think that that the regularity conditions are not very restrictive because the solution of an LP is non-unique only for degenerate cases. For example, the solution becomes unique if the 0.8s in the MultiSol preference matrix are replaced by values that are slightly different to each other (e.g., The 0.8s are replaced by 0.8001, 0.8002, 0.8003, and so on). Moreover, these conditions are required only for deriving an optimal constant in front of the O(log T) term: it is not difficult to prove that the algorithms have O(log T) regret without the regularity conditions.
>best-arm identification
We agree in that the best-arm identification is sometimes more appropriate than the reward maximization. We would also like to note that, most (not all) of best-arm identification literature deals fixed-time problems (fixed-confidence or fixed-round) and thus bandit algorithms are easier to deploy when the number of time steps (users) is difficult to estimate beforehand, even in the case where the best-arm identification the actual objective.
Responses to Reviewer 1:
As to parameter beta, please see the general response above. We will add a discussion on the role of these parameters in the camera-ready.
We will add some simple example and compare the regret of RMED and CCB in the supplementary material. This will (i) help understanding the lower bound and (ii) help understanding where the gap between the regret bounds of CW/ECW-RMED and CCB comes from.
Responses to Reviewer 2:
>best-arm identification
As pointed out by Reviewer 3, the primal application of the dueling bandit problem is the interleaved comparison in ranker evaluation tasks, where it is natural to consider an online utility (reward) maximization. In this case, comparing an arm with itself corresponds to a pure ranker (without interleaved comparison), and the utility is based on the goodness of each ranker. Please also see the general responses above.
>parameters alpha and beta
The logarithmic regret bound of CW/ECW-RMED is independent of the values of alpha and beta. For any alpha, beta>0, CW-RMED is optimal, and ECW-RMED has a divergence-based regret bound that is close to optimal.
Responses to Reviewer 3:
>mu_ij != 1/2
We think that it is difficult to remove the assumption as it is also posed in the previous paper (Zoghi+ NIPS2015).
Relaxing this assumption (possibly, \epsilon-tolerant Copeland winner?) is a possible future work from a practical point of view because sometimes mu_ij is close to 1/2.