Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization
Abstract
Deep neural networks with repeated blocks, such as transformers and ResNets, often exhibit closely related representational structure across layers that emerges with training. Motivated by this observation, we introduce Gradient Smoothing, a general training paradigm that couples gradient updates across blocks and admits a natural interpretation as a preconditioning method. Our framework applies structured smoothing operators to layer-wise updates, such as weighted averages and exponential moving averages, with minimal computational overhead. We evaluate Gradient Smoothing across a range of architectures and training regimes, including RL post-training of LLMs on reasoning tasks, as well as diffusion and classification with Vision Transformers. Across these settings, Gradient Smoothing consistently improves generalization performance, in addition to promoting structured representation evolution across layers. These results suggest that gradient smoothing is a simple and broadly applicable technique for improving training in modern deep networks.