Timezone: »

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith

Tue Jun 11 06:30 PM -- 09:00 PM (PDT) @ Pacific Ballroom #55

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.

Author Information

Daniel Park (Google Brain)
Jascha Sohl-Dickstein (Google Brain)
Quoc Le (Google Brain)
Samuel Smith (DeepMind)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors