When Internal Feedback Goodharts: SAE Rewards Fail to Improve Robot Success
Abstract
World feedback for robot learning is sparse and delayed, motivating dense proxy signals for filtering, credit assignment, or fine-tuning. Sparse autoencoders (SAEs) expose dense internal features in diffusion robot policies, but it is unclear whether such features are useful feedback or merely interpretable correlates. We report a controlled negative result testing this boundary in Octo-Base on SimplerEnv. SAE features show task and phase structure, native-state local action selectivity, and outcome association in a small pilot. They are also optimizable: a no-training action-chunk reranker can select candidates with much higher SAE reward than random chunk controls. However, this internal reward does not improve closed-loop task success. On four episode seeds where native Octo succeeds in all cases and our manual sampler succeeds in three of four, SAE-feature reranking succeeds in only one of four, matching random-chunk and random-feature reranking. This proxy-success gap is the main contribution: sparse policy features can be valid diagnostics and optimizable proxies while still failing as grounded world feedback under naive optimization.