We thank all the reviewers for their time and useful comments. Our responses are below. $ Reviewer_1: (1) Indeed, self-normalized inequalities are tighter since they are empirical. However, this is only an upgrade over the non-self-normalized counterpart. More precisely, the bound from Abbasi-Yadkori et al is a self-normalized version of Hoeffding’s inequality which doesn’t discriminate between large and small tail events like a Freedman-type one would. This turns out to be key for tighter bounds in our problem. (2) This issue seems to be unavoidable, at least for confidence bound based algorithms. These algorithms require a quantifiable (depending on \epsilon) confidence bound to decide on the optimal action (see for instance Bubeck et al. 13). In fact, even for the simpler scenario of bounded noise, knowledge of an explicit upper bound is necessary to design a bandit algorithm. It is definitely an interesting question, however, whether there exist algorithms that could forego this knowledge. 050: The expectation is over the rewards and the algorithm’s own randomization. Note that this is not the regret we use, which is defined on line 191. 156: The Big-O notation was meant to signify an upper bound on the quantity. We can reword this to be more clear. 159: Yes, that is what we meant by less regularity, and why we specified in parantheses \eps < 1. We will clarify this. 237: Thank you for the correction. 290: We will remove it. 387: Yes, thank you. 393: After a careful review of our proof it seems that replacing \alpha_t for \alpha_T in the first term of the bound in Lemma 3 would allow us to apply the stopping time argument of Abbasi-Yadkori. This is due to the fact that an application of this lemma requires a uniform upper bound on the deviation of ||X_t \eta_t|| and not a time dependent one like the one we have given. This replacement seems to yield worse constants in the final bound but gets rid of the logarithmic term of T coming from the union bound. Thank you for pointing this out to us. We will make sure to correct this in the final version. 420: Yes, that was a typo. After fixing it, X_t^T X_t is SPSD, (I+X_t^T X_t)^{-1} is SPD, and any eigenvalue \lambda >0 of the first matrix has an eigenvector corresponding to eigenvalue 1/(1+lambda ) for the second matrix. Therefore the eigenvalues of this matrix are of the form \lambda / (1 + \lambda ) < 1 425: Using the indicator function, we can add (|l_s|/|\alpha_s|)^\epsilon for free, and then remove the indicator function. The extra moments on |l_s| allow us to say that this quantity is still finite. 576: Yes, also Markov’s inequality. 604: Moving both terms to the same side yields yield E[X_i | F_{i-1}], which we assumed to be 0. 782: Yes, the only difference between experiments is the realization of the noise. Figure 1(b) is the graph of only a single replica. We apologize for the confusion and will clarify the caption. Reviewer_2: Comments on Abstract, References, Introduction, Algorithm 4: Thank you for spotting these typos. We will fix them. V_t on line 234: This is meant to be the definition of V_t, which should make sense at this point of the paper since X_t is already defined. We can clarify this. As a theory paper, the experiments in our work were mainly to serve as a proof of concept that our results hold. Currently we are trying to procure a real-world data set with heavy tails to try our algorithm. Reviewer_3: Online adaptation of epsilon: Please see (2) in the response to Reviewer_1. Advantage in experiments: You are correct that we can make the advantage arbitrarily large. In particular, since previous methods require an l_\infty bound as input, it is possible to make the confidence radius close to infinite while keeping the second moment close to 0. We will adapt our experiments to make the difference more dramatic. Linear appearance of regret in Figure 1(a): The regret appears linear because of the time horizon. We will let our algorithm run for more rounds in the final version of our paper. However, it is worth pointing out that this linear appearance can also be found in Figure 2 of Abbasi-Yadkori et al (2011), which made us ignore this issue. Choice of noise: Since our experiments are only meant to be a proof of concept, the choice of the noise was primarily for convenience so that we could fix the second moment while making the l_\infty bound arbitrarily large. Since the scale of the plots is large, even a standard deviation of 100 (as is the case in these experiments) would not be seen in the plots. Also, the jump seen in plot (b) is for only one realization of the experiment. Therefore, even if there were a jump in the regret curve of the "bounded" algorithm at that point and for that particular realization, this jump would not exist in the other 19 experiments, thereby smoothing out the regret curve. We will clarify the caption for the final version.