Poster Mon, Jul 6, 2026 • 10:00 PM – 11:45 PM PDT HALL A #1000

SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation

Hanxu Zhang ⋅ Chen Jia ⋅ Hui Liu ⋅ Xu Cheng ⋅ Fan Shi ⋅ Shengyong Chen

Project Page

Abstract

Achieving pixel-level accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure-Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure-Calibrated Insight Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512 on the TUT dataset, the model confirms its robust potential for efficient real-world deployment. The code is available at https://github.com/zhxhzy/SCRWKV.

Lay Summary

Achieving accurate pixel level crack detection in diverse scenarios is a major challenge. Existing methods struggle to balance tracing crack shapes with computational efficiency. They often fail to combine high quality results with low computer resource demands. To solve this we propose a highly compact network that achieves high precision while remaining highly efficient. Our model uses new mechanisms to improve texture representation and capture the continuous shapes of cracks. It also includes a dynamic feature that suppresses distracting background noise. Finally a lightweight decoder is used to precisely bring all the information together. Systematic tests on multiple datasets with complex backgrounds and severe interference show that our model significantly outperforms current leading methods. It achieves this while using only 1.22 million parameters confirming its strong potential for efficient real world use.