Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation?
Abstract
Semi-supervised learning has become a dominant paradigm for reducing annotation costs. However, we argue that the current progress is clouded by a dual problem of overconfidence. Algorithmically, prevailing pseudo-labeling frameworks often conflate prediction confidence with uncertainty, leading to severe confirmation bias and poorly calibrated models. Methodologically, since multiple benchmark datasets lack validation sets, some studies repurpose test sets as validation sets, leading to inflated results. Subsequent methods, compelled to employ the same strategy to surpass reported SOTA, trigger an arms race of overfitting. Exciting numerical gains may reflect test overfitting rather than genuine progress. Thus, we propose TCSeg, a tri-space calibrated segmentation framework founded on a principled dual-axis reliability assessment engine. It explicitly decouples confidence from uncertainty and uses this signal to detect and correct confirmation bias across feature, probability, and image spaces in a collaborative manner. Across three benchmarks, TCSeg delivers consistently strong performance at both the best and final checkpoints, and we report results under a multi-run protocol to reset the benchmark with a more realistic perspective. Core code is available: https://github.com/BubbleDirk/temporaryanonymoustcseg.