A Benchmarked Diagnostic for Sparse Decomposability of Dense Causal Subspaces
Abstract
Benchmarks for mechanistic interpretability should test not only whether a causal variable can be localized, but whether the localized subspace can be recovered in a reusable representation basis. We propose sparse decomposability as a benchmarked diagnostic for dense causal subspaces: given a DAS-style teacher and a fixed pretrained SAE dictionary, causal sparse distillation (CSD) measures how much interchange-intervention behavior survives when the intervention is constrained to a small set of SAE latents. The diagnostic is calibrated on a 16-cell synthetic benchmark with ground-truth supports, where CSD-L1 recovers correlated-distractor support (F1 = 1.00) while DBM and DiffMean controls fail. On dense-valid Gemma/Qwen tuples, two compact pre-CSD decoder-geometry statistics predict CSD/dense recovery with leave-one-out R2 = 0.89 and bootstrap 95% CI [0.79, 0.95], while model size alone gives R2 = -2.00. Public-harness evaluations and matched diagnostic controls then separate positive selector-specific MCQA cases from random-K-degenerate RAVEL rows and sparse-limited 27B sites. Matched random-K controls show that high CSD/dense recovery is not by itself evidence of meaningful feature selection. The result is a benchmark-facing instrument: it maps where dense causal variables are sparse-decomposable in existing SAE bases and where benchmark scores should not be interpreted as SAE-level explanations.