Paper ID: 138
Title: Accurate Robust and Efficient Error Estimation for Decision Trees

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents and proposes an approach for the estimation of the errors committed by decision trees.  

Clarity - Justification: A simple solution presented in a complicated way.

Significance - Justification: The problem is well-known and the novelty is disputable. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The method the authors present has theoretical foundations on Bayes error analyzed (for decision trees) by Devroye et al. 20 years ago.  Apart some few exceptions, all the references are from nineties. I am wondering whether this line of research is still active and how much this topic fits ICML scopes nowadays.   The analysis in the case of learning random forests can be very useful in this case, especially because the authors mention this aspect as one of the main motivations for their study.  However, the authors did not perform such a study.  Moreover, the authors claim that the error estimation they propose is suitable for preventing overfitting. In this perspective I would have expected a comparison with classical decision tree pruning techniques such as those summarized and evaluated several years ago: Floriana Esposito, Donato Malerba, and Giovanni Semeraro. 1997. A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Trans. Pattern Anal. Mach. Intell. 19, 5 (May 1997), 476-491.    Methodologically, the solution the authors propose is rather simple, but presented in a very complicated manner. Moreover, I’m not sure whether the contribution of theorem 3.1 is significant or not. Experiments should show the performances when the specific case of small samples is taken into account and when this aspect is not taken into account. Finally, for multiclass data, the authors do not impose any constraint on the values $\overline\eta^1(A_i)… \overline\eta^M(A_i)$. In particular, if the multiclass problem the authors are considering is 1 vs. all (only one class value is associated to each example), a constraint between the values  $\overline\eta^1(A_i)… \overline\eta^M(A_i)$ should be defined (e.g. $\sum_k{\overline\eta^k(A_i)}=1$. Otherwise, if the multiclass problem the authors are considering is (actually) multilabel, then the definition the authors are using is correct. According to the datasets the authors use, I suppose that the answer to my question is  

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a method for accurately estimating errors in leaf nodes of decision trees.  The method is quite fast and outperforms K-fold cross-validation.  The paper presents both theoretical derivations of the error decomposition and empirical validation.

Clarity - Justification: The paper is well organized and clear.  The only thing reducing the clarity is the amount of math, but that's unavoidable.

Significance - Justification: The advance is incremental but significant.  This looks to become the state of the art in error estimation.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Only small corrections.        "into each leaf nodes" -> "into each leaf node"        "thus obviating the need for a separated test datasets" -- phrase should be either singular or plural, not mixed.        "note that overfiiting"        "still unknown" -> "is still unknown"        "However when tree depths are getting exceedingly large," -> "However, as tree depths get exceedingly large"  

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is an interesting contribution to the topic of obtaining reliable estimates of the error at tree nodes. This is a relevant issue for instance for overfitting avoidance out also to improve the trees class probability estimates.   The authors compare their approach to cross validation estimates and claim their estimates are more accurate and more efficient to obtain in terms of computation costs. 

Clarity - Justification: The paper is very well written and reasonably easy to follow, although some of the more theoretical parts may eventually be more challenging to follow.

Significance - Justification: This is an interesting contribution to the important issue of obtaining reliable estimates of the decision trees error. The claims are well supported though I think the authors have missed some related work that should have been included in the comparisons.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors propose a method that provides a correction factor to the observed estimates based on the sample size. This is an interesting account that allows saving computation time when compared to sampling-based alternatives like for instance cross validation estimators. The authors do a convincing job in showing that the proposed estimator is competitive with this CV estimates, and moreover with obvious gains in terms of computation requirements.  My main problem with this work is that it does not refer other related approaches. Moreover, I think that these other approaches should in effect be compared to the current proposal, so it is not just a question of referring these works. In effect, there are other previous attempts to adjust standard error estimation within tree-based models for the problem of being based on small samples, particularly on lower level tree nodes. These alternatives share the same goals as the current work and have also the same computational advantage over more traditional sampling-based estimators. Whether the current estimator is more accurate or not is something to be shown and this would be interesting to know but I’m afraid the time constraints of conference papers do not allow this to be carried out in time. So, in my opinion, if this paper is accepted at least these other related works should be referenced though the ideal would be to include them in the comparisons.  Some of the works I've mentioned above include: - Quinlan (1993) used a correcting factor to resubstitution estimates in C4.5 based on the binomial distribution. - Cestnik (1990) proposed m-estimators that were used in Cestnik & Bratko (1991) to estimate class probabilities in the context of pruning decision trees. This approach was then generalized to regression trees by Karalic & Cestnik (1991). Also within regression trees Torgo (1997) has presented a similar correction approach based on the Chi-square distribution.  References: - Cestnik,B. (1990) : Estimating probabilities : A crucial task in Machine Learning. In Proc. of the 9th European Conference on Artificial Intelligence (ECAI-90), Pitman Publishers. - Cestnik,B. &  Bratko,I. (1991): On estimating probabilities in tree pruning, Machine Learning—EWSL-91, 1991 - Karalic,A., Cestnik,B. (1991) : The bayesian approach to tree-structured regression. In proceedings of the ITI-91. - Quinlan,J.R. (1993) : C4.5: programs for machine learning. Morgan Kaufmann, 1993. - Torgo,L. (1998): A Comparative Study of Reliable Error Estimators for Pruning Regression Trees, in Proceedings of the Iberoamericam Conference on AI (IBERAMIA-98)

=====