Oral
Jumpout : Improved Dropout for Deep Neural Networks with ReLUs
Shengjie Wang · Tianyi Zhou · Jeff Bilmes

Tue Jun 11th 05:10 -- 05:15 PM @ Hall A

Dropout is a simple and effective way to improve the generalization performance of deep neural networks (DNNs) and prevent overfitting. This paper discusses three novel observations about dropout when applied to DNNs with rectified linear unit (ReLU): 1) dropout encourages each local linear model of a DNN to be trained on data points from nearby regions; 2) applying the same dropout rate to different layers can result in significantly different (effective) deactivation rates; and 3) when batch normalization is also used, the rescaling factor of dropout causes a normalization inconsistency between training and testing. The above leads to three simple but nontrivial dropout modifications resulting in our proposed method jumpout.'' Jumpout samples the dropout rate from a monotone decreasing distribution (e.g., the right half of a Gaussian), so each local linear model is trained, with high probability, to work better for data points from nearby than from more distant regions. Jumpout moreover adaptively normalizes the dropout rate at each layer and every training batch, so the effective deactivation rate applied to the activated neurons are kept the same. Furthermore, it rescales the outputs for a better trade-off that keeps both the variance and mean of neurons more consistent between training and test phases, thereby mitigating the incompatibility between dropout and batch normalization. Jumpout shows significantly improved performance on CIFAR10, CIFAR100, Fashion-MNIST, STL10, SVHN, ImageNet-1k, etc., while introducing negligible additional memory and computation costs.