When Can We Trust Survival Model Evaluation ?
Abstract
Evaluating survival models under censoring is inherently challenging, yet standard evaluation practices are often applied without explicitly assessing how censoring distorts metric reliability. Performing a large experimental study, we analyze and quantify how survival evaluation metrics are affected in fundamentally different ways by the censoring rate and the censoring mechanism. Using a controlled semi-synthetic framework, we vary both the censoring mechanism (administrative, independent, covariate-dependent) and the censoring rate, and compare standard evaluations based on censored data with oracle evaluations using fully observed event times. This controlled setting enables us to quantify distortions along two complementary axes: numerical bias and preservation of model ranking. Across datasets and metric families, we find that censoring induces systematic, mechanism-dependent distortions. Moderate numerical bias, if not properly addressed, can lead to unreliable model comparison as censoring increases. These findings reveal fundamental limitations of common benchmarking practices and call for more careful interpretation of survival evaluation under realistic censoring.