ICML Poster Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs

Poster

Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs

Luca Arnaboldi · Yatin Dandi · FLORENT KRZAKALA · Bruno Loureiro · Luca Pesce · Ludovic Stephan

Hall C 4-9 #1505

[ Abstract ] [ Project Page ] [ Paper PDF ]

[ Slides] [ Poster]

Abstract: We study the impact of the batch size

n_{b}

$n_b$ on the iteration time

T

$T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches

n_{b} ≲ d^{\frac{ℓ}{2}}

$n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where

ℓ

$\ell$ is the information exponent of the target to be learned and

d

$d$ is the input dimension. However, larger batch sizes than

n_{b} ≫ d^{\frac{ℓ}{2}}

$n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, *Correlation loss SGD*, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

Chat is not available.