Poster
in
Affinity Event: The 6th Muslims in ML (MusIML) Workshop

When Depth Adds Nothing

Maryam Fatima

Project Page

Abstract

Chain-of-thought (CoT) monitoring is increasingly recognized as a fragile alignment affordance whose adequacy is threatened by latent-reasoning architectures such as recurrent-depth transformers (RDTs). A natural replacement candidate is to directly probe the depth dimension of the loop. We test this by finetuning a 2.3M-parameter RDT with Group Relative Policy Optimization on a task instrumented with an input-channel reward leak and training linear probes at every loop depth. Task probes achieve AUROC $= 1.0$ at every depth, but two pre-registered control probes on the pre-RL base model and probes on the input embedding alone, also achieve AUROC $= 1.0$, and a single-bit feature indicating leak presence in the input achieves AUROC $= 0.99$. We conclude that for an input-channel exploit in this architecture, the recurrent loop contributes no monitoring information beyond what is available from the input, that input-layer baselines should be a mandatory control for any depth-probing study on a recurrent architecture, and we identify three exploit classes for which a positive depth-localization result would be expected.