Paper ID: 1319 Title: Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces an algorithm for adapting some neural networks hyperparameters on-line during training, specifically some continuous regularization parameters. Relative to the standard technique of leaving regularization parameters fixed during training, the evidence in the paper suggests that there is little harm and significant benefit to allowing them to change slowly during training, to minimize a continuous approximation of validation set error. Clarity - Justification: The notation in equations 1 and 2 is non-standard for neural networks, it looks like conditional probabilities of a graphical model. The symbols on the figures are too small, and the axes are sometimes hard to understand. For example "Test: before T1-T2" doesn't mean the test error before running the T1-T2 algorithm, I think it means the test error from not using the T1-T2 algorithm and doing things the standard way. The Axes are sometimes mis-labeled. For example, in Figure 5, the title of the plot could be something like "Knowing the answer vs. finding it" instead of "% Error", to help the reader understand what question the Figure is addressing. The vertical axis should be labeled "Training with fixed hyperparameters (SVHN Test)" and the horizontal axis should be labeled "Training with T1-T2 (SVHN Test)". The caption should briefly explain the experiment protocol and the takeaway. Figures should stand on their own, the reader should be drawn to read details from the text by looking at the figures. Significance - Justification: The paper contributes evidence that a particular algorithm for online hyperparameter adaptation is effective. I would use the proposed algorithm in my own work. It would have been fun to see a tighter correlation in Figure 5, but regardless it's interesting to see what happened. I am glad the authors included that figure. Figure 4 is intriguing. I'm curious about how reliably the T1-T2 algorithm gathers a wide range of hyperparameters (maybe drawn randomly) into a tighter region where e.g. random search would tend to fare better. It seems that this could be the basis for new algorithm design, since every training run now tells us more information about the hyper-parameter space compared to fixed hyperparameter training. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): in abstract bellow -> below Sentence "Additional cost can be set below 30% of that of backpropagation, regardless of model size" doesn't make sense to me. Quotes around "hysteresis effect" are rendered wrongly by LaTeX (use `` and ''). A third often used regularization -> A third, often-used, regularization The last paragraph seems unsupported - using this method to tune a "continuous version" of the number of layers does not seem like an obvious extension of this work. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This is a very interesting paper concerning gradient-based optimization of continuous regularization hyperparameters *while* the optimization is running (as opposed to an outer loop hyperparameter optimization). In a nutshell, the authors propose to alternate (1) standard gradient-based optimization steps of model parameters (gradients with respect to training loss) and (2) non-standard gradient-based updates of the *hyperparameters* (using a validation set). In contrast to recent work by Maclaurin et al, the hyperparameter optimization steps are performed locally, rather than considering the effect of the hyperparameters on the entire optimization trajectory. Intuitively, this makes perfect sense, but in practice it may be tricky to get to work well since the process is not even guaranteed to converge. Clarity - Justification: The paper is written very well. Downsides: Figures 2 and 3 are not referenced in the text. Figure 6(left) is not referenced, either. Significance - Justification: While the topic is very important, experimental results are too preliminary to make this paper significant. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): On the positive side, the paper's topic is very relevant and novel, the method makes intuitive sense, and the paper is well written. On the negative side, convergence is unclear (which might be hard to change in practice, but which might also not necessarily be crucial in practice), and the experimental analysis is quite limited in scope: results reported for MNIST and SVHN and are quite far from the state of the art (MNIST) and very far from it (SVHN), respectively. This is important since it is typically easier to improve a suboptimal method than it is to improve the state of the art. If the method really works, it seems like it should be possible to apply it to a state-of-the-art network on some larger datasets. I have two questions for the rebuttal: 1) Do you have any results for networks that achieve state-of-the-art performance? 2) Figure 4: what is the best test-set performance of "test: before T1-T2"? It looks like it might be smaller than the best T1-T2 error of 0.8. Is this clearer when using log-log axes? Overall, I'm rating this paper as a weak reject for ICML. It presents a very nice idea that could become very important and I like its high novelty and relevance, but the experiments are too preliminary. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This is probably the first paper on hyperparameter selection that claims only a minimum overhead over normal training time, thanks a very simple joint optimization trick that yields excellent results on 2 datasets. Clarity - Justification: This paper reads very well, from a fantastic introduction to very thorough experiments. The main issues I found is that some details are overlooked (sloppy notation, missing references for the datasets) Significance - Justification: This work is curiously both incremental, but with potential for a major breakthrough. The idea presented here can be interpreted as a simplification of previous work, but the improvements in training speed are so large that it may make may practitioners like me change their mind about hyperparameter selection (we currently think it is too time consuming). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Originality: the authors get in great pains to show that their approach is actually a simplification of 2 previous gradient based approach for hyperparameters. While this may position this work as incremental, they should be also rewarded for their honesty and methodological cleverness. Technical quality: All the intuitions in this paper sound right, and the simplifications justified: it is unfortunate that the notation in section 2.1 is so hard to follow and sometimes sloppy. The notation in the equations is quite approximate: the authors use the gradient and the partial derivative notation interchangeably and I could not detect a consistent pattern. Nothing indicates when one has a vector (gradient) or a simple scalar (partial derivative). How the authors obtain equation (5,6,7) from Maclaurin et al is rather mysterious to me, as a momentum term in the original work would make this simple accumulation tricky. Even if one just sticks to the intuition (excellent I must say), there must be a typo in Eq(7) where the last term should be subtracted. What is g' in EQ. (7)? Detailed comments: - line 267: does C1 also depend on lambda? - line 308: subscript W is missing on the nabla C2. Experiments are very well thought and show clear improvements over 2 datasets: MNIST and SVHN (the authors should have described or at least give a reference for each dataset). =====