Spectral–Spatial Mixing with Morphology-Aware Adaptive Loss for Medical Image Segmentation.
Sanaullah Chowdhury ⋅ Lameya Sabrin
Abstract
Medical image segmentation requires balancing global context with computational efficiency, where self-attention mechanisms suffer from quadratic $\mathcal{O}((HW)^2 C)$ complexity. We propose S2M-Net, a parameter-efficient architecture (4.7M parameters) that achieves computational savings through Spectral--Spatial Token Mixing (SSTM). SSTM achieves $\mathcal{O}(HWC^2)$ complexity through efficient combination of $\mathcal{O}(HWC \log(HW))$ frequency-domain processing and $\mathcal{O}(HWCd)$ bottlenecked spatial gating ($d{=}16$), exploiting spectral concentration where $>93\%$ of energy is captured by $K{=}32$ low-frequency components ($\sim$0.8\% of the spectrum at $352{\times}352$ resolution). This design avoids self-attention's prohibitive $\mathcal{O}((HW)^2C)$ attention map computations while preserving global receptive fields. To handle geometric diversity, we introduce Morphology-Aware Adaptive Segmentation Loss (MASL), which automatically modulates five loss objectives based on per-sample morphological descriptors (tubularity, compactness, irregularity, and scale). Evaluation across 15 datasets spanning 8 modalities demonstrates competitive performance, obtaining the best performance on 14 of 15 datasets, with statistically significant improvements ($p < 0.0033$, Bonferroni-corrected) on 7 challenging tasks (complex morphology, class imbalance, and multi-class segmentation), and clinically meaningful gains ($0.5$--$1.6\%$ Dice) on 8 mature benchmarks. Notably, S2M-Net achieves $83.43\%$ Dice on EndoVis17 multiclass instrument segmentation ($+8.69\%$ over TransUNet and $+9.14\%$ over the best baseline UMamba at $74.29\%$), while using $12.8{\times}$ fewer parameters (4.7M vs.\ 60M).
Successful Page Load