The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel L Smith

Tue Jun 11th 04:35 -- 04:40 PM @ Hall B

We investigate how the behavior of stochastic gradient descent is influenced by model size. By studying families of models obtained by increasing the number of channels in a base network, we examine how the optimal hyperparameters---the batch size and learning rate at which the test error is minimized---correlate with the network width. We find that the optimal "normalized noise scale," which we define to be a function of the batch size, learning rate and the initialization conditions, is proportional to the number of channels (in the absence of batch normalization). This conclusion holds for MLPs, ConvNets and ResNets. A surprising consequence is that if we wish to maintain optimal performance as the network width increases, we must use increasingly small batch sizes. Based on our experiments, we also conjecture that there may be a critical width, beyond which the optimal performance of networks trained with constant SGD ceases to improve unless additional regularization is introduced.

Author Information

Daniel Park (Google Brain)
Jascha Sohl-Dickstein (Google Brain)
Quoc Le (Google Brain)
Samuel L Smith (DeepMind)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors