Interpretable Self-Supervised Learning via Representer Landmarks and Nyström Approximation
Abstract
Self-supervised learning (SSL) effectively learns representations from massive unlabeled data, yet the resulting models typically operate as black boxes, necessitating domain-specific post-hoc explanations. We introduce KREPES, a unified framework that learns inherently interpretable representations for arbitrary SSL objectives, including SimCLR, BYOL, VICReg. By bridging empirical neural tangent kernel approximations of neural networks with the Representer Theorem for kernels, we express the learned latent space directly via "Representer Landmarks", which are the representations of influential unlabeled training examples. We introduce two novel metrics, "Sample-Specific Influence Score" and "Conceptual Influence Profile", to quantify the transparency of the learned representations. KREPES enables direct audit of the latent space without supervision, for example, revealing an algorithmic bias in the Adult-1M dataset where SSL uses demographic proxies for income. Finally, to ensure scalability to SSL benchmarks with 1M+ samples (ImageNet-1K, Adult-1M), KREPES introduces a novel Nyström approximation-based optimization of any non-convex SSL objective.