STD-Former: Image-Conditioned Texture Dictionary Encoding with Sparse Topological Supervision for Texture Recognition
Abstract
Texture recognition is often framed as matching an image to a static training-set dictionary or codebook. In practice, this assumption is brittle: label-preserving transformations (illumination, scale, compression, blur) can shift test features away from the fixed training dictionary, producing a training-set codebook misalignment that limits accuracy. We propose STD-Former (Simple Texture Dictionary Transformer), a lightweight framework for image-conditioned texture dictionary encoding. Instead of comparing against a static codebook, STD-Former extracts a compact set of Intrinsic Textons (dictionary atoms / codewords) from the input image itself, yielding self-aligned representations at inference. Our design is intentionally simple and uses a decoupled two-stage recipe. In Stage 1, a Texture Dictionary Extractor (TDE) is pre-trained with a self-supervised Texton Coverage Loss that encourages the learned textons to collectively cover the image patch feature manifold. In Stage 2, a classifier is trained on the encoded dictionary representation; optionally, we add a Sparse Topological Loss derived from 0D persistent homology, which is equivalent to supervising only the (B-1) edges of a minimum spanning tree (MST) in each batch, providing efficient structure regularization. Across six standard texture benchmarks, STD-Former and STD-Former+ achieve new state-of-the-art results.