Paper ID: 344 Title: Heteroscedastic Sequences: Beyond Gaussianity Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper addresses the problem of heteroscedastic noise for prediction of time series. The authors treat the problem as one of online learning. This is different from the standard approach, which typically learns a model of the noise from batch data. One advantage of the online approach is that a partially adversarial heteroscedasticity can be treated, that is, one in which the assumption that environment is entirely stochastic is not needed. The paper proposes using a modification of LAZY Online Gradient Descent to simultaneously predict the signal and its variance. The authors provide an expected cumulative regret bound within a semi-adversarial setting, and also specialise their result to the ARCH model framework for time-series prediction. For the latter setting they furthermore provide supporting synthetic experimental results, comparing to a Maximum Likelihood approach. Clarity - Justification: The paper is generally easy to read. However the proofs are not easy to follow for a few reasons included below in the detailed comments section. Significance - Justification: At the moment, I have some issues with the proofs of the results. Consequently I the impact of the paper is greatly diminished. However, if the proofs can be corrected then the contribution is much better. It provides some advance both in the Online Learning community and also in the time series prediction community. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As briefly mentioned above, I have some issues with the proofs in the supplementary material for this paper. However, these aside, I am broadly positive about contribution, and I hope the authors (or other reviewers) can explain to me why my worries are not important. My Overall Rating will then rise accordingly. Problems in proof of Proposition 3.1: - First in lines 1023/1024, the second display of the proof, it should be noted that $$b_t(a_t) = E[\nabla \tilde{l}_t(a_t) | F_{t-1}] - \nabla l_t(a_t)$$ and not $$b_t(a_t) = E[\nabla \tilde{l}_t(a_t) - \nabla l_t(a_t)| F_{t-1}]$$ as has been used. This can be rectified by taking the conditional expectation with respect to the \tilde-term only in the inner product in lines 1020-1022. A similar issue is present in the following display. - A deeper problem lies in the last step of the proof of the proposition. In particular a^* depends on the loss functions l_t. However, these loss functions are themselves random variables over which the expectation is being taken. Therefore we cannot simply substitute a^*, which is a random variable and not a constant, into the final display of the proof to finish the argument. Indeed, note that: $$ E[R_T(l_1,...,l_T)] = \sum_{t=1}^T E[l_t(a_t)] - \sum_{t=1}^T E[l_t(a^*)],$$ however, since a^* is a random variable depending on all of the functions l_1,..,l_t, you cannot apply the argument given in lines 1028-1032. I would be very happy if the authors could find a solution to this problem, as it is the crucial result on which their arguments depend. I am hopeful that a fix exists, in light of the synthetic experiments, but I could not find it myself. Other typos and clarity issues: 041: 'resulted' -> resulting 128: 'need of' -> need for 129/130: 'In the years to follow' -> In the following years 206: 'useful to model' -> useful for modelling 214: 'are compatible to deal with kernels' -> are compatible with kernels 215: 'We turn to present' -> We therefore present 271: It should be clearly stated that u_0 and v_0 are chosen a priori by the environment 415: 'is the regret bound' -> is a regret bound 591: What is the matrix norm used? 732: 'recall' -> Recall 760: 'attaining regret bound' -> attaining the regret bound 858: 'As evident' -> as evidenced ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper applies online learning algorithms to the (statistical) problem of estimation and prediction of heteroscedastic signals. The algorithm that is developed – which combines online learning, kernel methods, and statistical modeling – is applied to an ARCH model on synthetic data. Clarity - Justification: The paper is clearly motivated, the idea being to apply online learning ideas in such a way that the guarantees are somewhere in between worst-case and standard statistical guarantees. The theoretical and empirical results are presented well. Significance - Justification: I very much like the idea of bridging the gap between statistical and worst-case guarantees. The technical contributions, however, are not extensive and the empirical results are minimal. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I like the idea behind the paper, which is to provide guarantees that are between worst-case and distribution-dependent. However, the core issue with the paper is that it is neither a compelling theory paper nor a compelling methodology paper. As a theory paper, it mostly exploits known results. The proofs mostly consist of combining known regret bounds, with some innovation, e.g., in Prop 3.1. The paper as more promise as a methodology paper backed by solid theory. Unfortunately it currently falls short in this regard. The authors claim that their “approach is more general than a [maximum likelihood (ML)]-based approach” and is “capable of coping with rather complex scenarios and models.” Neither claim is discussed again. The first claim seems plausible, but is not elaborated on. The second claim is rather vague and needs to be substantiated. More generally, in order to claim that the proposed methodology is an improvement over maximum likelihood, further experiments are needed. The paper would especially benefit from some real-data examples. If no ground truth is available then the online learning and ML estimate methods could be compared on their predictive performance. It would also be interesting to investigate the performance when using different kernels with the online method. Section 4 is not very compelling, so it could be put in an appendix in favor of more experiments. Small points: Prop 3.3: for this result, are you using Alg 2 but with modified version of \ell_t^{sig}, etc.? Line 20: theoretic => theoretical Line 41: resulted => resulting Line 255: a* is not defined Line 262: not clear that this section is going to define the heteroscedastic signal prediction problem Lines 320-329: unclear paragraph Line 372: incapability => inability Section 3: inconsistent use of P and \Pr Line 858: evident => evidenced ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers online linear regression (with kernels) when the responses y_t are sampled from a distribution that is chosen by an adversary subject to certain constraints on the mean and the variance. It then goes on to estimate the variance as well in a parallel online learning task, establishing sqrt(T) regret bounds for both tasks. The difficulty in the variance estimation task is that the true mean is not known and can only be approximated by the estimated mean. It also constructs average confidence bounds as in (5), but this requires estimating the means in a different way, so the regret bounds do not hold simultaneously. Finally, regret bounds are established for ARCH models with a proof that is similar to the proof for the linear regression setting. Clarity - Justification: The paper is well-written and the exposition is clear. Significance - Justification: The goal of the paper is to bridge the gap between statistical results that rely on stronger distributional assumptions like symmetry of the noise and adversarial online learning methods. From the perspective of online learning it might be viewed as introducing uncertainty quantification, which is clearly important, but appears not to have been considered before. Having the adversary select distributions from a restricted class instead of outcomes is natural and even unavoidable if you want to quantify uncertainty/variance, because it does not seem possible to define uncertainty in a satisfactory way when the adversary simply chooses outcomes. The proofs are relatively easy (which is a good thing of course), so the contributions are mainly conceptual. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Minor comments: In lines 135-139: u -> v? Start of section 2.2: it would be common to assume that the action set K is also closed. Assuming boundedness of the losses does not seem necessary, only boundedness of the gradient norms and of the diameter of K. In Thm 3.2 and Corollary 5.1, you mean that eta_sig = eta_var = eta. Corollary 5.1 seems a new Theorem rather than a Corollary, except that the details of the proof are not included. (I agree that the sketched proof should work though.) =====