Timezone: »
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Zhiyuan Li · Hong Liu · Denny Zhou · Tengyu Ma
Generating a sequence of intermediate steps, \emph{a.k.a.}, a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Formally, given input length $n$, we show that constant-depth transformers can solve $\mathsf{NC}^1$-complete problems like wording problem of $S_5$ provided with $O(n)$ steps of CoT and $\mathsf{poly}(n)$ embedding size. We further show constant-depth transformers can solve any problem in $\mathsf{P/poly}$ provided with $O(\mathsf{poly}(n))$ steps of CoT and $O(\log(n))$ embedding size. In contrast, it is shown (Liu et al., 2022) that constant-depth transformers without CoT can only solve problems in $\mathsf{TC}^0$. Under the unproven but widely believed assumption that $\mathsf{TC}^0\subsetneq \mathsf{NC}^1 \subsetneq \mathsf{Ppoly}$, allowing a longer chain of thought fundamentally increases the expressiveness of transformers. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
Author Information
Zhiyuan Li (Stanford University)
Hong Liu (Stanford University)
Denny Zhou (Google DeepMind)
Tengyu Ma (Stanford)
More from the Same Authors
-
2021 : Label Noise SGD Provably Prefers Flat Global Minimizers »
Alex Damian · Tengyu Ma · Jason Lee -
2021 : Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine -
2021 : Learning Barrier Certificates: Towards Safe Reinforcement Learning with Zero Training-time Violations »
Yuping Luo · Tengyu Ma -
2021 : Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine -
2021 : Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine -
2023 : The Marginal Value of Momentum for Small Learning Rate SGD »
Runzhe Wang · Sadhika Malladi · Tianhao Wang · Kaifeng Lyu · Zhiyuan Li -
2023 : Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization »
Kaiyue Wen · Tengyu Ma · Zhiyuan Li -
2023 : DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining »
Sang Michael Xie · Hieu Pham · Xuanyi Dong · Nan Du · Hanxiao Liu · Yifeng Lu · Percy Liang · Quoc Le · Tengyu Ma · Adams Wei Yu -
2023 : Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training »
Hong Liu · Zhiyuan Li · David Hall · Percy Liang · Tengyu Ma -
2023 Oral: Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models »
Hong Liu · Sang Michael Xie · Zhiyuan Li · Tengyu Ma -
2023 Poster: Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models »
Hong Liu · Sang Michael Xie · Zhiyuan Li · Tengyu Ma -
2023 Tutorial: Recent Advances in the Generalization Theory of Neural Networks * »
Tengyu Ma · Alex Damian -
2022 : Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence toMirror Descent »
Zhiyuan Li · Tianhao Wang · Jason Lee · Sanjeev Arora -
2022 Poster: Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification »
Ling Pan · Longbo Huang · Tengyu Ma · Huazhe Xu -
2022 Poster: Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal Stochastic Shortest Path »
Haoyuan Cai · Tengyu Ma · Simon Du -
2022 Spotlight: Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal Stochastic Shortest Path »
Haoyuan Cai · Tengyu Ma · Simon Du -
2022 Spotlight: Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification »
Ling Pan · Longbo Huang · Tengyu Ma · Huazhe Xu -
2022 Poster: Robust Training of Neural Networks Using Scale Invariant Architectures »
Zhiyuan Li · Srinadh Bhojanapalli · Manzil Zaheer · Sashank Jakkam Reddi · Sanjiv Kumar -
2022 Oral: Robust Training of Neural Networks Using Scale Invariant Architectures »
Zhiyuan Li · Srinadh Bhojanapalli · Manzil Zaheer · Sashank Jakkam Reddi · Sanjiv Kumar -
2022 Poster: Understanding Gradient Descent on the Edge of Stability in Deep Learning »
Sanjeev Arora · Zhiyuan Li · Abhishek Panigrahi -
2022 Spotlight: Understanding Gradient Descent on the Edge of Stability in Deep Learning »
Sanjeev Arora · Zhiyuan Li · Abhishek Panigrahi -
2021 : Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine -
2021 Poster: Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning »
Yaqi Duan · Chi Jin · Zhiyuan Li -
2021 Spotlight: Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning »
Yaqi Duan · Chi Jin · Zhiyuan Li -
2020 Poster: On the Expressivity of Neural Networks for Deep Reinforcement Learning »
Kefan Dong · Yuping Luo · Tianhe (Kevin) Yu · Chelsea Finn · Tengyu Ma -
2020 Poster: The Implicit and Explicit Regularization Effects of Dropout »
Colin Wei · Sham Kakade · Tengyu Ma -
2020 Poster: Individual Calibration with Randomized Forecasting »
Shengjia Zhao · Tengyu Ma · Stefano Ermon -
2020 Poster: Understanding Self-Training for Gradual Domain Adaptation »
Ananya Kumar · Tengyu Ma · Percy Liang -
2019 Poster: Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks »
Sanjeev Arora · Simon Du · Wei Hu · Zhiyuan Li · Ruosong Wang -
2019 Oral: Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks »
Sanjeev Arora · Simon Du · Wei Hu · Zhiyuan Li · Ruosong Wang