Paper ID: 44 Title: Additive Approximations in High Dimensional Nonparametric Regression via the SALSA Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers kernel-based non-parametric least squares regression under the assumption that the target and/or the hypothesis spaces has additive nature of some order d. The main results are a trick that keeps the computations feasible for d > 2, some theoretical analysis and extensive experiments. Clarity - Justification: The paper is clearly written. Significance - Justification: The algorithmic trick to reduce the computational complexity is really neat. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I had to review the paper for another conference last year and back then it was, after a long discussion, rejected. Fortunately, the authors addressed some of the concerns made then, in particular in view of the theoretical analysis. I now find it way more impressive, and the only question I have ist how the authors obtained the improvement. Regarding the experiments, I still have some concerns that are summarized in the text I wrote back then: I run some experiments as a sanity check (svm with least squares loss and gaussian kernel, hyperparameter selection by 5-fold CV on a standard 10x10 hyper-parameter grid, 100 repetitions using their data generation routines for training and test data, no manual interaction). Here are the results: data set (rank svm, salsa) error std skillcraft (1, 2) 0.541807 +- 0.032881 CCPP (5, 2) 0.072172 +- 0.003203 blog (1, 5) 0.829976 +- 0.668556 speech (11, 3) 0.035644 +- 0.005651 music (12, 4) 0.757494 +- 0.075209 telemonitoring-total (6, 5) 0.036535 +- 0.004982 airfoil (4, 6) 0.497779 +- 0.033582 I did not considered the following two larger data sets, since galaxy: no shuffling routine in code fMRI: no code to load and scale data I addition, I did not considered the data sets with training size < 500, since results in such scenarios are usually very noisy. I guess the first conclusion is that salsa performs (a bit?) more stable than the svm, since it ranks more consistently in the top. The second conclusion is that a couple of these rankings are a bit random (see the std), so that 'strong conclusions' should not be made in any direction. The last conclusion taken from their code is, that they did not use the standard label for prediction but another feature. For example, for the skillcraft dataset they predicted colum 16 (out of 20) instead of the usual column 2 (column 1 is a player id). For the datasets above this approach was also taken for 'speech' and 'telemonitoring'. Unfortunately, they did not mention this by a single word (at least I could not find a hint). No wonder, that I had a hard making a sanity check when reviewing the paper. I checked a few numbers of the old and the new submission and my impression is that the experimental section was not changed (please let me know, if I am wrong). Therefore, my old concerns are still valid. Recommendation The concerns regarding the experimental section should be addressed, e.g. by some change of wording. If this is done, I think, the paper should be clearly accepted. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a framework for learning additive models. The contribution are based on : i) the additive models use up to d order variable interaction. ii) some theoretical analyses on the excess risk of the model are provided. iii) experimental comparisons with a large number of models are provided. Clarity - Justification: The paper is clearly written, well motivated and relevant contributions. Significance - Justification: The contribution are interesting. However, its relation with the work of Bach et al " High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning" is not discussed at all. This paper of Bach aims at the same objective as the authors and use kernels similar to those of the authors (eg see equation 10 of Bach's paper). Experimental results compare the performance of the models to a large amount of different ones but on few small-scale datasets. Results show the robustness of the methods in terms of performances/ Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Strengths: * nice framework for high-order additive models * theoretical supports Weaknesses * no discussion with the work of Bach * efficiency on computation of equation (5) is not demonstrated ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a d-th order additive approximation to regression functions. As somehow expected, in comparison to first-order additive models, d-th order additive models are able to reduce bias while holding a limited excess risk (shown to be polynomial on the dimensionality D when the regression function is additive). The order d can also be used to control variance in the light of the bias-variance tradeoff. A comprehensive empirical assessment of the proposed method is carried out. Clarity - Justification: The paper is technically sound. The theoretical depth is ok, yet not terribly deep. The paper is well written and organized, although a brief discussion on the relevance and usefulness of the proposed method for solving challenging problems would be valuable to motivate readers. Significance - Justification: The contribution is not very novel in the sense that additive kernels have already been discussed in the literature. Also, additive models with high-order interactions are not novel approaches. However, based on the experiments, the proposal seems to give a significant improvement over existing additive models. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The sentence "if each k^(j) is a Gaussian kernel, then it is an ESP kernel if we set the bandwitdhts appropriately." is rather vague. In this regard, you have applied Gaussian kernels. Which other kernels could be used? Please, fix captions in Figure 1 that are hiding part of the plots. Keeping all the plots in a same scale may help you to attend to the sensibility with respect to D. The empirical evaluation is good and comprehensive. However, with regards to the computational cost, it is not that great. I would suggest that it should be either strengthened or dropped. Overall, the results of the paper are nice. I would rate it as weak accept. =====