Paper ID: 770 Title: No-Regret Algorithms for Heavy-Tailed Linear Bandits Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies linear stochastic bandit problems with heavy tailed noise. Authors propose two algorithms. The first algorithm truncates losses when they become too large and uses the truncated losses to estimate the parameter vector. Since truncated losses are bounded, self-normalized bounds can be derived and performance of the algorithm can be analyzed using existing techniques. The second algorithm divides time into stages, and in each stage plays a single action. The observations in each stage are divided into smaller batches, and average loss value in each batch is calculated. The median of these means is returned as a loss estimate for the fixed action. This estimate is fed into a standard linear stochastic bandit algorithm. For both algorithms, sublinear regret bounds are shown under a moment condition on noise. Numerical experiments show that the proposed algorithms outperform a standard linear bandit algorithm in a problem with heavy tailed noise. Clarity - Justification: The paper is mostly well written. Typos and minor comments: * Abstract: admits admits * References have extra parentheses! * Intro, page 2: an scenario * V_t is not defined in line 234. * In Algorithm 4: i=j? * In Introduction, (Liu and Zhao, 2011) is cited as a paper on linear bandit problem. But this paper seems to discuss only finite armed bandit problems. * Introduction: Abbasi-Yadkori et al show an O(n sqrt{T}) regret bound and not an O(sqrt{n T}) bound. Significance - Justification: Authors study an interesting problem. Theoretical contributions are nice, but I am not sure if they are sufficient for a publication. Authors can make the paper stronger by doing experimental studies in a more challenging problem. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Authors study an interesting problem. Theoretical contributions are nice, but I am not sure if they are sufficient for a publication. Authors can make the paper stronger by doing experimental studies in a more challenging problem. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper looks at the stochastic linear bandit problem. Here the action is a vector from Euclidean space, the reward is the inner product with some unknown 'mean vector' plus noise. The new ingredient in this paper is that the noise distribution is heavy-tailed. Previous approaches for this setting rely on sub-Gaussian noise. Two new methods are proposed. The first is based on truncating the reward. The second is based on the median-of-means estimator, which is applied in batches. There are three questions I would like to see addressed in the rebuttal: Both algorithms needs to be tuned with knowledge of the first existing moment and its expectation (epsilon and v). These parameters are typically unknown. Are they hard to estimate on the fly, i.e. can we adapt to them online? About the experiments. You boast about a 20% regret reduction. This seems rather minor to me. If you have better scaling in any of the problem parameters (and you should, because previous methods are not designed for your heavy-tailed regime at all) then why can you not make the advantage arbitrarily large? Also, why does the linear appearance of the regret in Figure 1(a) not contradict the regret bound? What is the reasoning behind the choice of the noise in the experiment? Your setting for gamma ensures that we have about one 'shock' on the entire dataset (as clearly visible in Figure 1(b)). How can it be that you say the errorbars are tiny with only 20 replications? Clarity - Justification: The paper is carefully structured. I enjoyed the discussion of strengths and especially weaknesses on pages 7 (bot) and 8 (top), with the specific suggestion for improvement via sharper concentration bounds.. Significance - Justification: The paper presents a new combination of existing methods and ideas. Certainly technically nontrivial, but not surprising. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Solid technical paper. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the problem of stochastic linear bandits under heavy-tailed reward distributions. The main contribution is a modification of the ConfidenceBall/OFUL algorithm that is guaranteed to achieve sublinear regret even when the reward distributions do not conform to the widely-used sub-Gaussian assumption, but they still have some regularity. Precisely, the authors prove that if the noise has some moments higher than expectation, their algorithms achieve sublinear regret, and the usual rates of O(sqrt{T}) are also recovered when all moments exist. The theoretical results are illustrated by some simple experiments. Clarity - Justification: The paper is written well and the technical quality is high. Significance - Justification: The paper considers a very important problem that has received little attention so far and provides technically sound results. The algorithms themselves are based on a natural combination of ideas by Bubeck, Cesa-Bianchi and Lugosi (2013) and Abbasi-Yadkori, Pal and Szepesvari (2011): essentially, the main idea is replacing the ridge-regression estimator in OFUL by a robust estimate, and adjusting the confidence balls accordingly. This combination and the analysis does need several non-trivial ideas so I have no problems with the paper in terms of originality. I am surprised to see however that the authors were not able to recover the O(sqrt{T}) rates for the case of finite variance, as was achieved by Bubeck et al. (2013) in the standard multi-armed bandit case. The authors argue that the main obstacle for achieving this improvement is the apparent lack of concentration results for self-normalized processes of the Freedman type (as opposed to the Hoeffding type). However, I'm not sure how to interpret this claim (line 802); I thought that self-normalized concentration inequalities are fundamentally stronger than Freedman-type inequalities (as they are in terms of empirical quantities and not the predictable variances). Can the authors elaborate a bit on this argument? Another apparent shortcoming of the results that they seem to need prior knowledge of epsilon to achieve the claimed rates. Is there any way of overcoming this problem? These questions aside, I think that the theoretical results themselves are quite strong---to my knowledge, they constitute the first theoretical guarantees for the linear bandit problem for (essentially) unbounded rewards. The experiments are simple, but they are effective in illustrating the advantage of the proposed algorithms. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think that the paper is definitely worth publishing. Some detailed comments: 050: What randomness does this expectation integrate over? 156: "It thereby quantifies the price of heavy-tailed loss distributions in this scenario"---this is a bit misleading. "Quantifying" this price to me would mean that there are also matching lower bounds that prove that there is no way of avoiding this extra cost. 159: It is not obvious to me that these bounds are "even better". Maybe you mean that they are better when epsilon is small? 237: The bounds of Abbasi-Yadkori et al. (2011) actually scale linearly with n. 290: "Notice that the second term can be bonded directly by T"---I don't think that this comment is helpful for understanding the result. 387: Shouldn't this be an inequality? 393: Is it really necessary to use the union bound here? Abbasi-Yadkori et al. (2011) managed to avoid it, and their arguments really don't seem that different from yours. 420: "The eigenvalues of the matrix [...] are less than 1."---how do we know this? Also, shouldn't the second matrix be inverted? 425: Where does the second inequality come from? Maybe the fact that l_s has bounded moments? 576: "the first term can be bounded by using the union bound"---and maybe also Markov's inequality? 604: "The fact that E[...|...]=-E[...|...]"---I don't see why this is true. Please clarify. 782: What does it mean that you plot the results "over 20 replicas of the same experiment". What was the difference between the experiments, only the realization of the noise? If so, where is the huge jump in Fig 1.(b) coming from? I would expect to see some huge error bars if the noise was freshly drawn (i.e., if the jump did not occur in the exact same round in all "replicas"). =====