Paper ID: 581
Title: Contextual Combinatorial Cascading Bandits
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes several extensions of combinatorial cascading bandits. These extensions are not groundbreaking but also not obvious. The authors propose a new learning algorithm, bound its regret, and evaluate it on three problems. One problem is synthetic and two are real-world.
Clarity - Justification: The paper is reasonably well written and easy to follow.
Significance - Justification: This paper proposes several extensions of combinatorial cascading bandits. These extensions are not groundbreaking but also not obvious. I have several comments on the derived upper bounds and experimental results. See my detailed comments.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Major comments: 1) This paper proposes three extensions to combinatorial cascading bandits: more general reward functions, position discounting, and linear generalization. It is confusing to propose so many extensions in a single paper. The authors should focus on one extension and do it well. The position-discounting extension seems trivial and I suggest that the authors discuss it separately from the main contribution, which is context. The results on more general reward functions (Section 4.2.1) are also not particularly strong. For example, the regret bound in Theorem 4.3 is not meaningful since 1 / p^\ast can be arbitrarily large. This cannot happen when p^\ast is tied to the reward function. But this is not the case in Section 4.2.1. The key step in this particular analysis is what Kveton et al. (NIPS 2015) suggest as being a problematic reduction. I am not sure why the authors adopt it. It is also not clear what is the real benefit of more general reward functions. These functions have to be monotonic, which seems like a strong assumption. The authors never give an example of a more general reward function or experiment with it. In summary, I suggest that the authors focus on the main extension, which is context, and then discuss the rest separately. 2) The design of Algorithm 1 is not surprising and essentially follows: http://jmlr.org/proceedings/papers/v37/wen15.pdf who proposed similar algorithms for stochastic combinatorial semi-bandits with linear generalization. The only difference in C^3-UCB is that the model is updated only based on the weights of observed items. The above paper also shows how to conduct a similar analysis to this paper for Thompson sampling. 3) The empirical results do not look good. In particular, both the regret and cumulative reward of Algorithm 1 in all plots in Figures 1-3 seem to be linear in time.This shows that the proposed algorithm does not learn over time. If the algorithm learned, the regret would be a concave function of time and the cumulative reward would be a convex function of time. The real-world experiment in Section 5.2 is especially bad. As far as I can tell, none of the two compared algorithms learns. The real-world experiment in Section 5.3 does not seem very realistic. In particular, the authors assume that the expected weights of items are in the span of features. This rarely happens in practice and therefore it is important to evaluate Algorithm 1 in a more realistic setting where this design assumption is violated. Minor comments: Line 291: The assumption on monotonicity seems pretty strong. Can you discuss its implications? Line 376: V_t norm is not defined. Algorithm 1: Why is the proposed algorithm called C^3-UCB? I do not get this. A more appropriate name would be something that reveals the components of the algorithm and the solved problem, such as combinatorial, cascading, and LinUCB. Line 498: Double dot in the equation. Last paragraph of Section 4.3: I am not sure what the authors mean. Line 812: It is not clear what "simple paths" means. Last paragraph of Section 5.3: It is not clear what this experiment is.
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper extends the cascading bandit framework to the contextual bandit setting. The reward of each base arm has a linear nature (sub-Gaussian with the expected value being the inner product of the feature vector of the arm and the context of that arm in round t), but observability is limited according to the cascading model. The dependence of the overall reward can be defined using the disjunction or the conjunction model as in previous works, but also in a much more general way. The authors propose a LinUCB-based solution called C^3-UCB, derive regret bounds (one for the general case, and one for the two special cases: the disjunction model and the conjunction model). Finally, they provide experimental comparison of C^3-UCB to some baseline methods.
Clarity - Justification: The paper is full of technicalities (even on the level of intermediate results), but is relatively easy to follow.
Significance - Justification: The proposed model is reasonable; in fact, contextuality seems essential in many potential applications of the cascading bandits.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper gives a thorough introduction to the related literature, and builds heavily on previous results. The paper is packed with (necessary) technicalities which could be made easier to digest by discussing (at least at some introductory level) how the model translates into the two special cases: the settings with the conjunctive and the disjunctive objective. The nature of the derived results meet the expectations, and the small unexpected details are discussed in detail. The experimental results are somewhat strange. Except for the synthetic setup, the regret does not seem to scale with \sqrt{n}. Minor comments: - line 606: f^* is only defined in Theorem 4.6. - lines 657-663 are hard to understand - line 791: m should be m_j - Section 5.3: it is not clear form the text how x_{t,a} was generated, and what information it contains. Or is this setting not contextual? But then it is not clear why C^3-UCB has so much better performance than CombCascade.
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduced a new problem called contextual combinatorial cascading bandits. It incorporated contextual information with combinatorial cascading bandits. The author(s) proposed an algorithm (C^3-UCB) as a general solution for contextual, cascading, position-discounting optimization.
Clarity - Justification: Most sections are quite understandable and well organized. It could be easier for readers to follow if the author(s) described problem setup in more details or in a better order (i.e. weight, reward). The figures are easy to read except for the labels.
Significance - Justification: The considerations of contextual information (with disjoint linear model) and position discounting are novel. This paper also considered the weights and reward function in a more general way. But why using alpha-regret instead of the true regret?
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The introduction, related works, algorithm, and results are in good shapes. More efforts should be put in the problem setting part for a clearer explanation. Also justify the usage of alpha-regret. On experiments it would be much better to see more comparisons among different contextual bandit algorithms. It seems the runtimes of all 3 experiments are not long enough to confirm logarithmic regret. I would consider the true rating between weak accept and weak reject.
=====