Paper ID: 898 Title: Barron and Cover's Theory in Supervised Learning and its Application to Lasso Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper is about extending the result of Barron and Cover (BC) on MDL (Minimum Description Length) principle for model selection. BC (1991) showed that a risk based on the Renyi divergence is tightly bounded from above by redundancy of a MDL estimator which is defined as the minimizer of the total codelength of a two-stage code. In this paper, the authors provide an extension of BC theory to supervised learning without quantization in random design cases. Clarity - Justification: The paper is rather straightforward to read. The contributions are clearly marked, the differences between related work are highlighted, and the analysis is clearly presented. Significance - Justification: I think that the results of the paper are novel and go beyond the previous work on BC theory on supervised learning (e.g. the work of Yamanishi, 1992, which assumes independency between the description length and sample sequence x^n). I think an important implication of the results of this paper is that they gives a mathematical justification of the MDL principle in supervised learning. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think in overall this is a decent paper. I would like to ask the authors to elaborate more on the difference of their results and the work of Yamanishi (1992). In particular, can the authors provide precise examples where bounds of Theorems 1,2 are not tight and by using Theorems 5-7 we obtain better bounds? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In a series of papers, Barron and co-authors have developed information- theoretic techniques to interpret penalized likelihood procedures on continuous parameter spaces as two stage code lengths, with accompanying risk bounds in terms of the information-theoretic redundancy. A limitation of these techniques, when applied in regression or any other supervised context where covariates x are present, is that the two-stage code has to be the same for all possible values of the covariates. This is no problem if one analyses a fixed design, but breaks down when analysing random designs. The present paper extends the techniques by Barron et al. to random designs (Thm 5) to derive risk bounds for the Lasso with normally distributed random design (Thm 7 and eqn 7). If I understand correctly, the main technical idea is to restrict attention to the typical set for the covariates, for which it is then possible to construct a two stage codelength that is the same for all possible values of the covariates (c.f. Definition 4). The risk bounds are derived conditional on this typical set, which is unusual. Clarity - Justification: The paper's main focus is technical: it studies how Barron et al.'s techniques can be extended to the random design setting. Deriving insight into the behaviour of the lasso, which might be gained by the new analysis, comes second in the presentation. As a consequence, its target audience is technical specialists interested in Barron et al.'s techniques. For this audience, the presentation should be clear. One strange choice in the presentation is that the main result for the lasso is only given as equation 7 and not as a separate theorem. It is also strange that this result is stated only with expectations that are conditioned on the typical set for the covariates. Significance - Justification: I believe the main technical contributions are: (1) the restriction of the analysis to typical sets for the covariates, and (2) to show that the resulting condition of "epsilon-risk validity" is satisfied by the lasso with sufficiently large penalization factor lambda. Although (1) is technically straight-forward, it is interesting to see that this approach works for the lasso. Step (2), although technically analogous to the analysis for Gaussian graphical models by Chatterjee and Barron, 2014b, requires a significant bit of new work in Appendix C to derive a suitable second-order Taylor approximation for the Rényi divergence. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): I think the direction of this work is a very good one. The extension of BC Theory to supervised learning using a modified notion of risk validity, based on a typical set, seems interesting and well-motivated by the LASSO application. The proofs seem sophisticated, and the form of the results on the LASSO differ from the typical kinds of guarantees produced within that community; however, it does not seem like the authors go beyond the typical sub-Gaussian assumption on the covariates of previous works, and so whether this work constitutes an expansion of regimes in which the LASSO can be analyzed is unclear. The paper itself also suffers from numerous clarity issues, as described below; resolving these would drastically improve the paper. Clarity - Justification: My detailed comments discuss the clarity issues. Some lack of clarity stems from language issues, but I tried to overlook those as much as possible to give a fairer assessment. I did find the Introduction to be clear for the most part. Significance - Justification: The authors idea of extending BC-Theory to supervised learning, via their modified form of risk validity based on a typical set, at first seems like a quite brittle extension that may not have useful applications. However, the authors show that they can handle the case of the normalized LASSO (where the normalization matches what I see in the literature), and so already it appears that their extension of BC-Theory may constitute actual progress. I would the authors' contribution as more significant if they could argue that their ``typical set''-modified risk validity may apply in some general subclass of settings; even providing just one additional example beyond the LASSO would amplify the weight of their contribution. With regards to the LASSO analysis, it seems that the authors still rely on a sub-Gaussian assumption in their analysis, and so, the final results for that application might fail to offer something new. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): \documentclass{article} \usepackage{amsmath,amssymb,amsfonts} \begin{document} Overall, I found this work quite unclear at many points. I tried to list the various areas below (as well as other points not related to clarity), in the hope that this work can improve in the future. The authors mention that most other works on the LASSO assume boundedness of features or moment conditions. Indeed, the works I am aware of that go beyond boundedness assume that the random design matrix consists of i.i.d.~Gaussian random vectors (see e.g.~''Sharp thresholds for high-dimensional and noisy recovery of sparsity'' by Wainright, but there are many more works in the more recent past that likely highlight the same point); however, this is precisely what the authors do as well (as per Theorem 7)! Technically, this work seems strong from what I have checked of the proofs, but there are some proofs that are missing. The authors say they omitted proofs of some results due to space restrictions, but surely they know that ICML allows for supplementary material of unbounded length precisely for this reason. The proof of Theorem 6 really should appear in the supplementary material. Sparsity is a common perspective in the previous non-BC-Theory analyses of the LASSO. Is it possible to interpret the authors' main results for the LASSO in the case when a sparse solution, or approximately sparse solution, is optimal? This would help frame the current results in the context of previous results on the LASSO. In the statement of Theorem 6, I think the '$>$' inside the $\Pr(\cdot)$ should instead be '$<$'; otherwise the result says that with high probability the R\'enyi divergence is high! Why do you use $n \cdot d_\lambda(\ldots)$ in Definition 4 but just $d_\lambda(\ldots)$ in Definition 3? \subsection*{Minor comments} In the math display immediately after line 210, the second $\arg \min$ should involve not $p_{\beta 2-p}(y^n,x^n)$ as defined, but instead $p_\theta(y^n \mid x^n) \cdot \exp(-L(\theta \mid x^n))$. In the proof of Theorem 5, there is a repeated typographical error in the first math display on page 6 (column 1). Starting from the 2nd line there, you should remove the negative sign inside the first $\exp$ on all the lines until the end. Can you explain further the statement concluding equation (3): ``In lasso, the third term in the left side of (3) is unbounded with respect to $x^n$\ldots''. Why is this the case? Is it because we might have an $x^n$ with points with unbounded norm, and then the choice of $\theta$ equal to the zero vector makes $p_\theta(y^n \mid x^n)$ blow up? A short explanation would be helpful here. I don't understand the second-to-last sentence of Section 3: ``This problem would never happen if $\tilde{L}$ could depend on $x^n$.'' Is this obvious? Does it follow from later results in the authors' paper? What if one chooses a bad $\tilde{L}$? Why do we know that a good enough $\tilde{L}$ (that depends on $x^n$) exists? Theorem 7: change ``Gauss distribution'' to ``Gaussian distribution'' \end{document} =====