We present a statistical analysis of the AUC as an evaluation
criterion for classification scoring models. First, we consider
significance tests for the difference between AUC scores of two
algorithms on the same test set. We derive exact moments under
simplifying assumptions and use them to examine approximate practical
methods from the literature. We then compare AUC to empirical
misclassification error when the prediction goal is to {\em minimize
future error rate}. We show that the AUC may be preferable to
empirical error even in this case and discuss the tradeoff between
approximation error and estimation error underlying this phenomenon. |