Dear reviewers,$
We want to thank you for your reviews, praising our contributions, and suggesting acceptance in 2 out of 3 cases. We start our rebuttal by addressing your three major concerns:
* Modeling assumptions in DCM bandits *
We agree that the independence and stopping assumptions in DCM are strong, and therefore we stress them throughout the paper. Nevertheless, based on the cited survey of Chuklin et al. (2015), DCM performs very well in practice. In particular, SDCM (DCM trained by MLE) can explain click data almost as well as CCM and SDBN (Figure 7.1), where the user can leave without clicking. SDCM can be estimated orders of magnitude faster than CCM and also faster than SDBN (Table 7.2).
The benefit of our modeling assumptions is that the number of learned parameters is L, one probability per item. In ranked bandits, the number of parameters is K L, one probability per item-position pair. Therefore, when our assumptions hold / nearly hold, we expect an order of K lower regret than in ranked bandits. Slivkins et al. (2013) study contextual ranked bandits. We also believe that our approach can be contextualized and benefit from features.
* Algorithmic contributions *
dcmKL-UCB can be viewed as a natural variant of CascadeKL-UCB where the learning agent observes multiple clicks. Therefore, the design of dcmKL-UCB is not surprising. However, it is not obvious which multi-click objective is optimized by dcmKL-UCB. We identify this click model, DCM; and show not only that dcmKL-UCB optimizes this model, but also that dcmKL-UCB is regret optimal on a large subclass of these models. This is highly non-trivial.
* Real-world experiments *
We choose the Yandex dataset because it is one of the largest public click datasets. Figure 4a shows what happens when the estimation procedure of Guo et al. (2009b) is applied to our data and we honestly report our findings. Both trends slope downwards but are not monotonic. We recognize that our model may not perfectly fit real-world data and discuss why we can still learn in this setting below.
Our real-world experiments in the submitted paper are limited. To demonstrate that our improvements are real and significant, we compare dcmKL-UCB to RankedKL-UCB and RankedExp3 on 20 most popular queries in the Yandex dataset. RankedExp3 is a ranked bandit with Exp3 that can model correlations between recommended positions. We parameterize Exp3 as in the Auer's original paper. All algorithms assume that higher ranked positions are more valuable, which would be the setting in practice. This is not necessarily true (Figure 4a). However, this assumption is quite reasonable because most of our estimated query models look as follows. The first position is most terminating and the most attractive item is significantly more attractive than the other items. Therefore, any solution that puts the most attractive item at the first position tends to perform well. We compare all algorithms by the average n-step regret over all 20 queries, with 5 runs per query:
dcmKL-UCB: 33.08 (n=100), 190.33 (n=1k), 569.80 (n=10k)
RankedKL-UCB: 47.24 (n=100), 352.14 (n=1k), 2024.67 (n=10k)
RankedExp3: 54.33 (n=100), 528.49 (n=1k), 3258.27 (n=10k)
All of our improvements are significant at p < 0.05. At n = 10k, our regret is almost 4 times lower than that of our baselines. The reason for this improvement is discussed in our modeling assumptions above. An interesting idea for future work is to adaptively switch from DCM bandits to ranked bandits when we detect that our assumptions are significantly violated.
* Other comments *
1) In line 479, A^* is a vector of items with highest attraction probabilities. Since \bar{v}(k) are identical for all k, and \omega(x) is invariant to the permutation of x, the items in A^* can permuted such that any optimal item in A_t matches the corresponding item in A^*. This implies that \bar{w} \odot \bar{v}|_{A^*} \geq \bar{w} \odot \bar{v}|_{A_t} and therefore we can apply Lemma 1. This is not obvious and we forgot to explain it. This trick cannot be applied in line 526 because the termination probabilities are not identical. Therefore, we introduce an auxiliary solution A_t' (line 531), which is A_t where the items are sorted in decreasing order of their attraction probabilities. From the definition of A_t' and the optimality of A^*, \bar{w} \odot \bar{v}|_{A^*} \geq \bar{w} \odot \bar{v}|_{A_t'} and we use Lemma 1 to bound the corresponding regret. The difference in the payoffs of A_t' and A_t is bounded by Lemma 3. Lemma 3 is an upper bound for a permutation of x and decreasing termination probabilities. It differs from Lemma 1, which is an upper bound for any x and y such that x \geq y.
2) [0, 1]^E denotes the set of vectors whose entries are in [0, 1] and are indexed by the members of E. Since E = [L], this is equivalent to [0, 1]^L.