We thank the reviewers for their positive feedback.$ * On constructing the tree (AR1, AR4, AR5) Although the construction of the tree further benefits of expert knowledge, in all of the experiments presented in the paper, the tree is built using the natural Euclidean structure, with a recursive top-down clustering in two clusters. Only for the CIFAR experiments we initially use the labeling. The later splits over the deformations are also based on the Euclidean structure. * Novelty (AR4) We are puzzled by the statement that "the general idea is very standard". The only example we know of techniques related to importance-sampling during training is the recent paper by Zhao & Zhang. Their paper is theoretical, and provides results restricted to a strict subclass of linear predictors. To some extent the Loosli & Bottou's active sampling already incorporated related ideas, but it does not present neither density modeling nor explicit formulation of the samples' importance, and offers no speed-up of the training phase. Regarding the IST itself, we are not aware of any work proposing a recursive adaptive sampling method similar to it. The only related methods are the MCTS, which -- while similar in the way they recursively balance between exploration and exploitation -- do not address at all the sampling problem, but rather optimize a functional, and as such are pretty remote from what we propose (as pointed out by AR1). We would be very interested in getting through the meta-review some bibliographic references if we indeed have missed a relevant body of works. * Clarity of the math (AR1) The camera ready version will address the remarks regarding the sign and magnitude in equations (2) and (4). We will clarify the distinction between the \mu_\Theta, which are the probabilities at the leaves resulting from the policy we use to visit the tree (i.e. from the importance sampling). These probabilities are always defined for all the samples (and initially uniform) and the w_n, which are the actual weights of the samples, known only for ones we have already actually sampled. The confusing sentence "s(x) is the sum of the weight estimates" means that we sum through iterations the unbiased estimators of the weights. Phi(x) is a function of u(x) and l(x) as said in line 305. In the experimental section (line 695 to 710), Phi_mean(x) is a function of u(x) and l(x), Phi_max(x) of u(x) and it was clearer to directly write theta(x) for the bound case. A formal definition of the linear stumps will be added. * Convergence of the estimate (AR1) This is a very good point, as the non-stationarity of the target distribution could indeed appear at first as problematic. We ran additional experiments to assess the stability of the convergence by computing the correlation between the \hat{w} through two different runs of the CNN training on CIFAR after each epoch. The correlation goes from -0.017 after the first epoch, to 0.9897 after 10 epochs and remains above this value after that. This will be added to the paper. * Cost of sampling (AR1) The cost of sampling is log N and take a few thousands CPU operations per sample, to compare to the millions or billions of flops. In practice, an epoch lasts roughly 30 min (1.8 M images on GPU) and the sampling takes 4 min in our current implementation, which involves non-optimized bindings between Lua and C. * Cost of clustering (AR1) For the CIFAR experiments, although the tree contains 1.8 M images, only the 50,000 original images are clustered. The rest of the tree is defined analytically and unfolded on-the-fly when the tree is visited. The cost of clustering the support vectors is virtually 0 because in practice it could be done once. * Related works and introduction (AR1) We will move a part of the related works (e.g. MCTS) to the discussion in the conclusion, since their relation to our method is indeed more through general principles such as sampling and exploration / exploitation then as existing methods that we extend.