Timezone: »
Large language models like GPT-4 exhibit emergent general-purpose capabilities, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from scratch, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective.We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that include intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed.We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.
Author Information
Nayoung Lee (University of Wisconsin-Madison)
Kartik Sreenivasan (University of Wisconsin-Madison)
Jason Lee (Princeton University)
Kangwook Lee (UW Madison, KRAFTON AI)
I am an Assistant Professor at the Electrical and Computer Engineering department and the Computer Sciences department (by courtesy) at the University of Wisconsin-Madison. Previously, I was a Research Assistant Professor at Information and Electronics Research Institute of KAIST, working with Prof. Changho Suh. Before that, I was a postdoctoral scholar at the same institute. I received my PhD in May 2016 from the Electrical Engineering and Computer Science department at UC Berkeley and my Master of Science degree from the same department in December 2012, both under the supervision of Prof. Kannan Ramchandran. I was a member of Berkeley Laboratory of Information and System Sciences (BLISS, aka Wireless Foundation) and BASiCS Group. I received my Bachelor of Science degree in Electrical Engineering from Korea Advanced Institute of Science and Technology (KAIST) in May 2010.
Dimitris Papailiopoulos (University of Wisconsin-Madison)
More from the Same Authors
-
2021 : Empirical Study on the Effective VC Dimension of Low-rank Neural Networks »
Daewon Seo · Hongyi Wang · Dimitris Papailiopoulos · Kangwook Lee -
2023 : Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding »
Seongjun Yang · Gibbeum Lee · Jaewoong Cho · Dimitris Papailiopoulos · Kangwook Lee -
2023 : Looped Transformers are Better at Learning Learning Algorithms »
Liu Yang · Kangwook Lee · Robert Nowak · Dimitris Papailiopoulos -
2023 : Scaling In-Context Demonstrations with Structured Attention »
Tianle Cai · Kaixuan Huang · Jason Lee · Mengdi Wang · Danqi Chen -
2023 : Fine-Tuning Language Models with Just Forward Passes »
Sadhika Malladi · Tianyu Gao · Eshaan Nichani · Jason Lee · Danqi Chen · Sanjeev Arora -
2023 : Reward Collapse in Aligning Large Language Models: A Prompt-Aware Approach to Preference Rankings »
Ziang Song · Tianle Cai · Jason Lee · Weijie Su -
2023 : Provable Offline Reinforcement Learning with Human Feedback »
Wenhao Zhan · Masatoshi Uehara · Nathan Kallus · Jason Lee · Wen Sun -
2023 : Provable Offline Reinforcement Learning with Human Feedback »
Wenhao Zhan · Masatoshi Uehara · Nathan Kallus · Jason Lee · Wen Sun -
2023 : How to Query Human Feedback Efficiently in RL? »
Wenhao Zhan · Masatoshi Uehara · Wen Sun · Jason Lee -
2023 : 🎤 Fine-Tuning Language Models with Just Forward Passes »
Sadhika Malladi · Tianyu Gao · Eshaan Nichani · Alex Damian · Jason Lee · Danqi Chen · Sanjeev Arora -
2023 : How to Query Human Feedback Efficiently in RL? »
Wenhao Zhan · Masatoshi Uehara · Wen Sun · Jason Lee -
2023 Poster: Efficient displacement convex optimization with particle gradient descent »
Hadi Daneshmand · Jason Lee · Chi Jin -
2023 Poster: Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning »
Yulai Zhao · Zhuoran Yang · Zhaoran Wang · Jason Lee -
2023 Poster: Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings »
Masatoshi Uehara · Ayush Sekhari · Jason Lee · Nathan Kallus · Wen Sun -
2023 Poster: Looped Transformers as Programmable Computers »
Angeliki Giannou · Shashank Rajput · Jy-yong Sohn · Kangwook Lee · Jason Lee · Dimitris Papailiopoulos -
2023 Poster: Transformers as Algorithms: Generalization and Stability in In-context Learning »
Yingcong Li · Muhammed Ildiz · Dimitris Papailiopoulos · Samet Oymak -
2023 Poster: Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing »
Jikai Jin · Zhiyuan Li · Kaifeng Lyu · Simon Du · Jason Lee -
2022 Poster: GenLabel: Mixup Relabeling using Generative Models »
Jy yong Sohn · Liang Shang · Hongxu Chen · Jaekyun Moon · Dimitris Papailiopoulos · Kangwook Lee -
2022 Spotlight: GenLabel: Mixup Relabeling using Generative Models »
Jy yong Sohn · Liang Shang · Hongxu Chen · Jaekyun Moon · Dimitris Papailiopoulos · Kangwook Lee -
2021 : Dreaming of Federated Robustness: Inherent Barriers and Unavoidable Tradeoffs »
Dimitris Papailiopoulos -
2021 Poster: Coded-InvNet for Resilient Prediction Serving Systems »
Tuan Dinh · Kangwook Lee -
2021 Oral: Coded-InvNet for Resilient Prediction Serving Systems »
Tuan Dinh · Kangwook Lee -
2021 Poster: Discrete-Valued Latent Preference Matrix Estimation with Graph Side Information »
Changhun Jo · Kangwook Lee -
2021 Spotlight: Discrete-Valued Latent Preference Matrix Estimation with Graph Side Information »
Changhun Jo · Kangwook Lee -
2020 Poster: FR-Train: A Mutual Information-Based Approach to Fair and Robust Training »
Yuji Roh · Kangwook Lee · Steven Whang · Changho Suh -
2020 Poster: Closing the convergence gap of SGD without replacement »
Shashank Rajput · Anant Gupta · Dimitris Papailiopoulos -
2019 Workshop: Coding Theory For Large-scale Machine Learning »
Viveck Cadambe · Pulkit Grover · Dimitris Papailiopoulos · Gauri Joshi -
2019 Poster: Does Data Augmentation Lead to Positive Margin? »
Shashank Rajput · Zhili Feng · Zachary Charles · Po-Ling Loh · Dimitris Papailiopoulos -
2019 Oral: Does Data Augmentation Lead to Positive Margin? »
Shashank Rajput · Zhili Feng · Zachary Charles · Po-Ling Loh · Dimitris Papailiopoulos -
2018 Poster: DRACO: Byzantine-resilient Distributed Training via Redundant Gradients »
Lingjiao Chen · Hongyi Wang · Zachary Charles · Dimitris Papailiopoulos -
2018 Oral: DRACO: Byzantine-resilient Distributed Training via Redundant Gradients »
Lingjiao Chen · Hongyi Wang · Zachary Charles · Dimitris Papailiopoulos -
2018 Poster: Stability and Generalization of Learning Algorithms that Converge to Global Optima »
Zachary Charles · Dimitris Papailiopoulos -
2018 Oral: Stability and Generalization of Learning Algorithms that Converge to Global Optima »
Zachary Charles · Dimitris Papailiopoulos