Subgroup Discovery with the Cox Model
Abstract
We study the problem of subgroup discovery for survival analysis, where the goal is to find an interpretable subset of the data on which a Cox model is highly accurate. We examine why existing quality functions are insufficient for this problem and introduce two technical innovations: the expected prediction entropy (EPE), a novel metric for evaluating survival models that predict hazard functions, and the conditional rank statistics (CRS), which quantifies individual point deviation from a subgroup's survival time distribution. We study the EPE and CRS theoretically and show they address problems with existing metrics. We then introduce seven algorithms for Cox subgroup discover. Our main algorithm is based on the DDGroup framework of Izzo et al. (2023) and leverages both the EPE and CRS, allowing theoretical correctness guarantees in well-specified settings. Empirical evaluation on synthetic and real data confirms our theory, showing our methods recover ground-truth subgroups in well-specified cases and achieve better model fit than naively fitting the Cox model to the entire dataset. A case study on NASA jet engine simulation data demonstrates that discovered subgroups uncover known nonlinearities in the data and suggest design choices mirrored in practice.