Paper ID: 566
Title: DCM Bandits: Learning to Rank with Multiple Clicks

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors extend the cascading bandits of Kveton et al. 2015 for the case where the system has to produce a ranked list of documents and the user may click on several items (rather than a single one).  The authors consider the "dependent click model", where the user goes through the list from first to last, and at each round may click on an item, and, should the user be satisfied, then the user may decide to leave the navigation. The reward is 1 if the chooses to stop, and 0 if the user goes through all the items without voluntarily stopping. Under independence assumption of clicks/stopping probability given items/clicks, and assuming that the order of stopping probabilities is known, the objective can be formulated as ordering items with higher click probability at the position with highest terminating probability. The authors show an upper bound on the regret of a UCB-like algorithm for that case, as well as a matching lower bound. some experiments on toy and real datasets show that the algorithm performs well compared to some baseline, according to their reward function.  The main originality is some form of combinatorial bandit with non-linear reward function (even though the independence assumptions severely reduce the "combinatorial" nature of the problem. A second originality of the framework is that the learner does not get to see wether on-clicks at the end of the list arise because the user voluntarily stopped (reward 1) or because the user went through the whole list (reward 0).

Clarity - Justification: The paper is well written overall, even though I think some efforts could have been done to make the presentation a little bit clearer.

Significance - Justification: The framework studied in the paper seems original and interesting, even though the independence assumptions of clicks is rather strong. Technically, the contribution is limited as the algorithm and the proofs are adapted from Kveton et al. 2015, but it is still valuable.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Being able to consider several clicks on the same result page seems important, since it probably happens in practice. Overall I rated weak reject because the assumptions of independence click probabilities given the item seems strong, and because the experiments seem a little weak.  I think the presentation could be improved: - the ranked bandits e.g. of Slivkins et al. 2013 were designed to tackle the problem that click probabilities between items were not conditionally independent (e.g. if docs 1 and 3 are similar but the user didn't click on doc 1, then he most likely won't click on doc 3). It would be interesting to clarify the intended differences of behavior between such algorithms and the DCM model considered here. - Lemma 1 and Lemma 3 are very similar - line 479: I don't see why Lemma 1 can be applied here. What is the difference with the proof of theorem 2 line 526 ? - the idea that "the user is satisfied when (s)he terminates navigation after a click" probably makes sense but is still fairly debatable; in the experiments, we can see that the estimated termination probabilities are not monotonically decreasing with the rank (see e.g. position 2 vs position 3 for query 2 Figure 4.a: position 3 has much higher termination probability, even accounting for standard deviations). The practical conclusion that we are supposed to make is that in this search engine scenario, the algorithm should rank documents with higher click probabilities at rank 3 rather than rank 2 (at least for that query). That doesn't make much sense to me. On this kind of data, it would also be natural that the ranked bandits do not necessarily perform well, since they assume that lower ranks are better. Maybe evaluating the new algorithm on the rankedBandit objective would give a better picture of the behavior of the different algorithms. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the problem of learning to rank in an online setting. The learning task consists of choosing a subset of web pages and a ranking over them in each round which is then recommended to the user. The user then clicks on possible more than one items. The attraction probabilities and termination probabilities are stationary. The proposed method is based on a reduction technique which decomposes the regret into positionwise regrets. The positionwise regret is upper bounded similarly to the analysis of KL-UCB. Regret lower and upper bound is given which linearly depends on the number of webpages and the length of ranked list which is shown to the user. The theoretical findings are supported by numerical and benchmark experiments where the Ranked bandits  by Radlinski et al 2008 is used as baseline. 

Clarity - Justification: I think the paper should be polished carefully. There are some mistakes and inconsistencies regarding the notation but this can be easily corrected. 

Significance - Justification: The paper relies on Kveton et. al. 2015 where similar setup with single-click had been introduced. To my best knowledge the proposed multi-click approach is novel.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I liked the paper. The authors tackled a real world problem by introducing an interesting model which is then rigorously analysed by means of lower and upper bound. Even if the paper should be polished (see minor comments), I recommend it be accepted.   Minor comments:  In some references: Szepesvari -> Szepesv\'ari line 154: [0,1]^{E} -> [0,1]^L line 201: a should not be bold because like in line 225 line 213: r should not be bold since it is a value line 223: v,w and A should be bold.  At least in this way the notation would be more consistent. line 232: {0,1}^E -> {0,1}^L  Proposition 1: C should be the function of i

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents DCM bandits, which formalize the problem of learning to rank from multiple clicks. The paper formulates and introduces a new problem DCM bandits and proposes dcmKL-UCB for solving this problem. An analysis of dcmKL-UCB provides gap dependent regret bounds for the algorithm. 

Clarity - Justification: Since cascading bandits are so close to the problem this paper presents, the paper would benefit from providing a formal background on cascading bandits. This would help avoid vague statements like "An analogous property was used in the design and analysis of algorithms for cascading bandits." and make it possible to formally explain the proof of proposition 1. 

Significance - Justification: The paper builds on already present work to propose a new problem, a new algorithm and the analysis of the algorithm. I think the paper contains enough work to make significant advances towards pushing the state-of-the-art.   The experiments could be improved to make the efficacy of the algorithm more convincing.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I like this paper. It is well written and, in most places, the explanation, and intuition behind the ideas are clearly presented. It tackles an important problem of learning to rank from multiple clicks, so it is of interest to the machine learning community.   Some comments:   The paper regularly refers to the cascading bandits for results, concepts and ideas. Thus, it would be helpful to include a formal description of cascading bandits. This would help make the contributions clearer as well.   The experiments are not the most convincing part of the paper. Section 5.1 and 5.2 do not add anything to the paper but take a lot of space. The real-data experiments are interesting but could have been more convincing if a greater number of queries were tested. Also, I think the baseline 'rankedKL-UCB', although relevant and important, is a very weak baseline for DCM bandits as proposed in this paper. So that also makes the experiments less convincing.   In fig 4.c for very small values of n something weird is going on.   Minor points:   A mathematical definition of w = Pr(?) and v = Pr(?) would help. Is w \in [0,1]^{E} correct? since E is a set, shouldn't it be w \in [0,1]^{L}?   Section 4.3, Lower Bound, is bit sloppy and disconnected from the paper. The reader has to really press hard to understand the point here. 

=====