$We appreciate reviewers’ constructive and critical comments which improve the quality of this paper. Raised concerns i.e. comparison with related work (R1 & R6), contribution and novelty of the proposed method (R6), experiments (R6) are addressed as follows. “there are other previous attempts to adjust standard error estimation within tree-based models for the problem of being based on small samples”(R1) A: Among all related work reviewed in “A Comparative Analysis of Methods for Pruning Decision Trees”[1] (R6), Quinlan’s binomial distribution based correcting factor and Cestnik’s m-probability based Minimum Error Pruning are the two closest pre-pruning methods, which share the same computational advantage of the proposed method. A comparison with these two methods will be carried out and reported in the camera-ready of this paper (if accepted) and the journal version which is in preparation. Despite the lack of empirical comparisons for now, a close inspection of error terms used in Quinlan and Cestnik’s methods reveals crucial differences from the summation of Variance and Bias terms in (10) of our paper. Indeed the m-probability based expected error (EER) is equivalent to a special case of the Bias term in (13) with m=1/2, while ignoring the influence of Variance term completely. Nevertheless, variance terms often have non-negligible contributions for leaf nodes with small sample sizes e.g. <10, and we will make this point more explicit in the revised paper. “… within regression trees Torgo (1997) has presented a similar correction approach based on the Chi-square distribution”(R1) A: In Torgo's paper an error criterion relying on sample variance was adopted to rank pruned trees, in a post-pruning manner. As admitted by the author, this error estimate was unreliable for most real-world problems in which the assumption about normality of labels may fail. Whether or not this error can be used for the pre-pruning problem in our case remains to be seen. “The problem is well-known and the novelty is disputable”, “the solution the authors propose is rather simple, but presented in a very complicated manner”(R6) A: To our best knowledge, the decomposition of the upper bound of generalization errors into quantized error and sampling error in (6) has not been reported before for tree-based classifiers. Since the sampling error depends on the unknown quantized posterior probability \overline\eta^y(A_i), the upper bound of its expectation is then estimated by Variance and Bias terms in Theorem 3.1. In particular, the introduction of Variance term in (10) is a main differentiator from existing methods. In terms of accuracy and robustness, we are not aware of any existing methods that are superior to kfold Cross Validation. Note that both Variance and Bias terms of sampling errors are due to the lack of sufficient samples in small regions. One should not confuse Variance and Bias terms in (10) with the problem of trade-off bias and variance in data fitting, in which “bias” often refers to the lack of fitting data for large regions e.g. in [1]. In our method, the quantized error is used for that purpose instead. “a constraint between the values $\overline\eta^1(A_i)… \overline\eta^M(A_i)$ should be defined (e.g. $\sum_k{\overline\eta^k(A_i)}=1$”(R6) In Lines 283-285, the posterior probability is normalized w.r.t. the probability measure associated with A_i. This normalization is not an imposed constraint, but is derived from the theoretical consistency analysis framework. “I am wondering whether this line of research is still active and how much this topic fits ICML scopes nowadays… The analysis in the case of learning random forests can be very useful… However the authors did not perform such a study”(R6) A: Theoretical analysis of driving forces of random forest is our ultimate goal, which will be pursued in future work. In order to analyze random forest error rate, we had to work out a robust error estimate for individual tree classifiers first. And the study turned out to be useful in its own right: the proposed error estimate provides an efficient substitute of K-fold Cross Validation methods for tree based classifies. “Experiments should show the performances when the specific case of small samples is taken into account and when this aspect is not taken into account”(R6) A: Our experiments (Figs 3-6) show that as tree depths get exceedingly large, the generalization error often flattens out or even increases due to large overfitting errors. This trend is more pronounced when the number of samples is smaller (e.g. upper-left plot in Fig 3). In the journal version of this paper, it is possible for us to also report estimated errors w.r.t. the number of samples thrown into each leaf node. [1] F Esposito, D Malerba, and G Semeraro. 1997. A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Trans. Pattern Anal. Mach. Intell. 19, 5 (May 1997), 476-491