Chain-of-Thought Gradient Descent
Hong-Yu Chen ⋅ Venkat Ganti ⋅ Jerry Yao-Chieh Hu ⋅ Hude Liu ⋅ Han Liu
Abstract
We show that Chain-of-Thought (CoT) enables a fixed single-layer transformer to efficiently approximate the training process of an $N$-layer feed-forward network in-context. Since FFNs are universal approximators, this result provides strong theoretical evidence for the expressive power of CoT. Specifically, we improve the computational cost of the prior best in-context result [Wu et al., ICML 2025] by $O(N)$. Building on the insight of the recursive nature of CoT, we reuse the single-layer transformer autoregressively instead of stacking the same transformer blocks to perform multiple In-Context Gradient Descent (ICGD) updates. The key novelty is a dynamic-masking scheme: at each CoT step, the attention heads are forced to see only the tokens needed to compute the result of the current forward or backpropagation update. This selective reuse contrasts with earlier ICGD proofs for neural network optimization, which must carry all information through every layer simply because a later gradient update might need it. Our numerical validations backup our theory.
Successful Page Load