Thank you for valuable comments and feedback, we will update the paper based on the review. We are glad the reviewers found the paper interesting and saw the possibility of high practical significance. The following are comments on specific points raised by the reviewers.$ Reviewer 1: “...reported for MNIST and SVHN and are quite far from the state of the art (MNIST) and very far from it (SVHN), respectively. This is important since it is typically easier to improve a suboptimal method than it is to improve the state of the art.” It is true that comparison using more complex SOTA models would be more interesting. Our goal however, is to compare the method to search methods requiring multiple training runs through the data, like grid and random search. It seems likely that performance of various hyperparameter search methods is orthogonal to the exact choice of the model. The "orthogonality to model" assumption is further evidenced by using a variety of model settings (Figures 4-6). “Figure 4: what is the best test-set performance of "test: before T1-T2"? It looks like it might be smaller than the best T1-T2 error of 0.8. Is this clearer when using log-log axes?” Indeed, best test error without T1-T2 is lower than the best test error with T1-T2. T1-T2 is not guaranteed to find the absolute optimal result: given enough runs, exhaustive search will likely find a better solution. Moreover, T1-T2 optimizes for the negative log-likelihood cost, the minimum of which will not correspond exactly to the best classification error (Figure 7). Reviewer 2: “How the authors obtain equation (5,6,7) from Maclaurin et al is rather mysterious to me, as a momentum term in the original work would make this simple accumulation tricky. Even if one just sticks to the intuition (excellent I must say), there must be a typo in Eq(7) where the last term should be subtracted. What is g' in EQ. (7)?” Thank you for the comments. Indeed, the momentum term does make it more complicated, and the eq (7) by approximation, considers only contribution of the previous gradient to the momentum term. The symbol g referred to the function describing dependency of each update \delta w_{t,t+1} on the previous gradients, i.e. \delta w_{t,t+1} = g(grad(w_t), grad(w_{t-1}), grad(w_{t-2}),...) (which is also why the minus disappeared). Conclusions from the paper are clearer for the case when there is no momentum term, i.e. \delta w_{t,t+1} = - lr * grad(w_t). We will make sure to clarify this. “Experiments are very well thought and show clear improvements over 2 datasets: MNIST and SVHN (the authors should have described or at least give a reference for each dataset)” Thank you, we will add references to the datasets. Reviewer 3: “The symbols on the figures are too small, and the axes are sometimes hard to understand” Thank you, we will improve the figures and captions. “Figure 4 is intriguing. I'm curious about how reliably the T1-T2 algorithm gathers a wide range of hyperparameters (maybe drawn randomly) into a tighter region where e.g. random search would tend to fare better. It seems that this could be the basis for new algorithm design...” This is indeed an interesting avenue for further study! “Sentence "Additional cost can be set below 30% of that of backpropagation, regardless of model size" doesn't make sense to me.“ As argued in Section 2.2, the additional cost of using T1-T2 depends on which and how often the T2 hyperparameters are updated. For tuning noise, T2 updates cost at most 3 times as much as T1 updates (i.e. the cost of backpropagation). In our experiments, we use 10 T1 updates per each T2 update. The cost could be even lower if T2 updates are used less often, however we have not tried this. We will clarify this in the paper. “The last paragraph seems unsupported - using this method to tune a "continuous version" of the number of layers does not seem like an obvious extension of this work.” One possible implementation of a continuously parametrized number of layers is the following model: the final softmax layer takes input from all the hidden layers, however the contribution of each layer is weighted with a continuous function s.t. one layer and its neighbouring layers contribute the most. For example, output = softmax(sum_{i=1..L}(W_i*h_i*a(i))), where L is #layers and a(i) = e^{\frac{{i-m}^2}{v}}. So which layer contributes the most to the output layer is set with a differentiable function, and parameters of this differentiable function (m and v in example) can in principle be trained using T1-T2. Moreover, stochastic hyperparameters could be trained using a method for approximating gradient of stochastic variables, e.g. Tang et al (2013). We will add the example to the paper. Overall: We would also like to thank the reviewers for all the other comments regarding improving the clarity of presentation, and we will update the paper accordingly.