Shortcut-Resistant CAM Distillation for Long-Tailed Recognition
Abstract
Real-world datasets often follow a long-tailed distribution, making generalization to tail classes difficult. We revisit this problem through the lens of shortcut learning, where models prefer the easiest predictive cues (e.g., background or textures) over object-centric semantics, especially under scarce and biased supervision. We find that this tendency is amplified for tail classes: limited examples often share similar contexts, making non-semantic signals highly correlated and thus tempting shortcuts, whereas head classes with diverse appearances and environments encourage more stable object-focused representations. Motivated by this observation, we propose Shortcut-Resistant CAM Distillation (SRCD), a plug-and-play framework that transfers object-focused explanations from head to tail classes. SRCD operates in the Class Activation Map (CAM) space, where a CAM provides a class-specific spatial evidence map for a prediction. SRCD aggregates CAMs from a small set of head-class candidates into a shortcut-resistant teacher using an energy-model weighting based on coherence (Laplacian smoothness) and concentration (Hoyer sparsity), and distills it to the tail-class CAM. We provide a theoretical analysis that quantifies shortcut reliance as shortcut-region evidence mass in CAM space and shows that SRCD suppresses tail shortcuts. Extensive experiments on long-tailed benchmarks consistently improve strong baselines.