Memory Savings at What Cost? A Study of Alternatives to Backpropagation
Kunjal Panchal ⋅ Sunav Choudhary ⋅ Yuriy Brun ⋅ Hui Guan
Abstract
Forward-mode automatic differentiation (FmAD) and zero-order (ZO) optimization are increasingly proposed as memory-efficient, backpropagation-free alternatives for large language model (LLM) fine-tuning, yet their benefits are typically evaluated only against standard backpropagation (BP), omitting memory-efficient variants such as activation checkpointing. We present a unified theoretical and empirical comparison of BP, checkpointed BP, FmAD, and ZO for LLM and vision-language model training, showing that while FmAD and ZO reduce activation memory, they trade memory for higher computational cost and longer wall-clock time to convergence, resulting in lower accuracy and slower training, especially under constrained perturbation budgets. Across models, BP with checkpointing outperforms FmAD and ZO variants, including variance-reduced methods, achieving up to 31.1% higher accuracy, 34.8% faster convergence, and 3.8$\times$ fewer computations at comparable memory usage, while also revealing instability-related failure modes in FmAD and ZO. Overall, our results correct a one-sided benchmarking narrative by showing that memory-efficient methods entail fundamentally different trade-offs, and that ignoring these distinctions has led to misleading conclusions about LLM optimization in prior work.
Successful Page Load