Paper ID: 579
Title: Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper addresses the Copeland dueling bandit problem, which is a generalization of the dueling bandit problem, where the problem might not have an arm that beats all other arms (the so-called Condorcet winner), in which case the goal is replaced by finding an arm that beats the highest number of other arms, called a Copeland winner.  The authors prove a regret lower bound for the problem and then, inspired by the lower bound, they provide a computationally inefficient algorithm (i.e. CW-RMED) with a matching upper bound. To remedy this inefficiency, the authors propose a more practical algorithm (called ECW-RMED) that has a slightly worse regret bound. Moreover, they prove in Section 4.3 that ECW-RMED is asymptotically optimal as long as the Copeland winner is not unique, which I found a bit counter-intuitive, since that includes the Condorcet case, which is supposed to be easier.  In addition to the theoretical results, the authors provide compelling experimental evidence for the efficancy of ECW-RMED in solving the Copeland dueling bandit problem. 

Clarity - Justification: The formal proofs are reasonably easy to check because they're essentially machine readable, but there is a certain amount of austerity on the part of the authors when it comes to giving intuition for both the results and the proof techniques. In particular, the statement of the lower bound was very difficult to parse and there should be some intuition being able to estimate its value on example preference matrices.  To be fair, this is a rather complicated problem: a fair bit more so than the regular dueling bandit problem with the Condorcet winner, so it is understandable that the expressions would be complicated as well, but the authors should put some effort into explaining the results using some simple examples to help the reader.

Significance - Justification: The authors give a rather clean solution for a challenging problem, improving substantially over the state of the art. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think the results included in this paper are about as good as one can expect from a bandit paper on a new, practically relevant problem (hence the overall rating), and most of my criticisms (listed below) concern the exposition, which can easily be remedied in the CRC.  The only thing that I would like for the authors to respond to is the following criticism of the algorithm: from a practical point of view, the most unsatisfying thing about these algorithms is the second inequality in Line 4 of Alg 1, which essentially requires the user to lower bound the smallest gap in the preference matrix using the parameter \beta. For instance, if your smallest gap is 0.001 and you happen to set \beta to be 0.01, then you might have to wait e^{e^10} = e^22026 iterations for the loop inside Line 4 to be terminated.  Given this, the only question I want the authors to answer is how the rest of the algorithm would need to be modified to get rid of this condition and the parameter beta. I think it's important to add a paragraph or two in the main body of the paper discussing this.  Other things that I think should be discussed in the paper are as follows, most of which can be added to the supplementary material: 1- A more elaborate discussion of the constants in the lower bound, using specific examples. 2- A quantitative comparison of the constants in the lower bound against those of the lower bound in the Condorcet case proven in (Komiyama et al, COLT 2015), in particular in light of Inequality (9) and my comment in the summary of the paper. 3- It would be nice if the authors could expand on the numbers in Table 1 by discussing one of the more glaring examples in the appendix, just to explain where the improvement comes from.  Aside from this, I think it's a great paper.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors considers the regret optimization problem in the so-called dueling bandit problem where the learner is allowed only to compare two arms and observe the outcome of the stochastic pairwise comparison. An arm i is said to be preferred against arm j, if p_{i,j} > 1/2, that is, the probability of a win for i is bigger than 1/2 in a pairwise duel against j. In general, there is no Condorcet winner in this setting, therefore the Copeland winners are considered to be optimal. This notion is well-defined for arbitrary preference matrix.   The regret is defined in terms of the difference between the Copeland scores of an optimal arm and the chosen arms. Regret lower bound is computed based on the Lai-Robbins lower bound. An algorithm is proposed which achieves optimal lower bound based on the DMED algorithm. This algorithm has prohibitively high computational complexity because it solves an LP in every step. Therefore the authors proposed an improvement of the algorithm which is based on the observation that the top-k  solution of the LP can be selected based on a simple criterion. The regret bound for the improved algorithm does hold only under stronger condition. The proposed algorithms are tested on LTR data from LETOR, the SUSHI data, and on the synthetic data. 

Clarity - Justification: The notation is very complicated. I spent significant time with looking for some notation which was defined earlier. It might be worth to summarize the notations used in the paper in a table. 

Significance - Justification: The problem considered is interesting. The proposed algorithms are based on DMED. The lower bound is interesting.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): If there is only a single Copeland winner, than the regret is zero if and only if the single Condorcet winner is compared to itself. In fact, this gives the rise to the question of that does it make sense to optimzie the regret in this setting. The motivation of this regret is not clear. Best arm identification seems to be more reasonable in this setting. Moreover, this observation seems to be also justified by the fact that the proposed algorithms first identifies the best arm or set of best arms with a certain confidence and then these arms are compared to each other. In the standard setting, Munos and Bubeck showed that an optimal algorithm for finite horizon (e.g. best arm identification) cannot be optimal for infinite horizon. But here the difference does not seem to be essential. Therefore the reward optimization seems to be not weel-motivated.   Regarding Theorem 3: why alpha and beta are needed in the Theorem ( the same for Theorem 4). It is somewhat disappointing that the original motivation of the paper was that the preference matrix can be arbitrary because Copeland ranking is used. And then the regret bounds do hold when some regularity proporty of the preference matrix are satisfied.   In my opinion Figure 1 does not bring too much. Based on the text it was clear why this is an important gap.   The notation is very complicated. I spent significant time with looking for some notation which was defined earlier. It might be worth to summarize the notations used in the paper in a table.  line 285: An strong -> A strong

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper contributes an asymptotic regret lower bound for the k-armed duelling bandit problem, proposes an algorithm for that problem with an asymptotically optimal regret bound, and proposes a computationally efficient variation to that algorithm which is near-optimal.

Clarity - Justification: The paper is well written and well organised.  Some concepts that are mentioned but not explained, e.g., Borda winner, will be unclear to readers without background on duelling bandits, but this is a minor issue.  All proofs are relegated to supplementary material which is a shame for a theoretical paper, but perhaps unavoidable.

Significance - Justification: To my knowledge, the only previously existing lower bound for duelling bandits was that due to Yue et al., which applies only to a restrictive class of duelling bandits.  Therefore, the contribution of this new lower bound strikes me as quite significant.  The algorithmic contributions are also substantial, offering both an asymptotically optimal algorithm and a near-optimal but more practical alternative.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, this is an excellent paper.  In addition to the theoretical and algorithmic contributions, there is also a solid set of experiments, comparing the proposed methods against strong baselines.    A couple minor quibbles:  -The assumption of uniqueness of the optimal solution in 4.4 is a bit awkward.  The authors show that this doesn't matter experimentally, but this is only one particular preference matrix.  It would be nice to have an argument as to why this wouldn't matter in general.  -A similar concern applies to the assumption that \\mu_ij != 1/2 if i != j.  Can the authors comment on how likely this is to hold in practice?  -I'm not convinced by the argument in the intro about the generality of the duelling bandit problem.  Sure, relative feedback arises in many settings.  But where, besides information retrieval, do you care about regret minimisation instead of just best-arm identification?  It seems to me that this happens only when you have something like interleaved comparisons, that allow you elicit feedback online, without bothering the user.  If there are other examples, the authors should mention them.  

=====