Track: Deep Learning (Theory) 2

Wed 11 July 8:00 - 8:20 PDT

Essentially No Barriers in Neural Network Energy Landscape

Felix Draxler · Kambis Veschgini · Manfred Salmhofer · Fred Hamprecht

Training neural networks involves finding minima of a high-dimensional non-convex loss function. Relaxing from linear interpolations, we construct continuous paths between minima of recent neural network architectures on CIFAR10 and CIFAR100. Surprisingly, the paths are essentially flat in both the training and test landscapes. This implies that minima are perhaps best seen as points on a single connected manifold of low loss, rather than as the bottoms of distinct valleys.

Wed 11 July 8:20 - 8:30 PDT

Comparing Dynamics: Deep Neural Networks versus Glassy Systems

Marco Baity-Jesi · Levent Sagun · Mario Geiger · Stefano Spigler · Gerard Arous · Chiara Cammarota · Yann LeCun · Matthieu Wyart · Giulio Biroli

We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are the complexity of the loss-landscape and of the dynamics within it, and to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and data-sets, suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. At large times, when the loss is approaching zero, the system diffuses at the bottom of the landscape. Despite some similarities with the dynamics of mean-field glassy systems, in particular, the absence of barrier crossing, we find distinctive dynamical behaviors in the two cases, thus showing that the statistical properties of the corresponding loss and energy landscapes are different. In contrast, when the network is under-parametrized we observe a typical glassy behavior, thus suggesting the existence of different phases depending on whether the network is under-parametrized or over-parametrized.

Wed 11 July 8:30 - 8:40 PDT

Not All Samples Are Created Equal: Deep Learning with Importance Sampling

Angelos Katharopoulos · Francois Fleuret

Deep Neural Network training spends most of the computation on examples that are properly handled, and could be ignored. We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on "informative" examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup. The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%.

Wed 11 July 8:40 - 8:50 PDT

Learning Deep ResNet Blocks Sequentially using Boosting Theory

Furong Huang · Jordan Ash · John Langford · Robert Schapire

We prove a \emph{multi-channel telescoping sum boosting} theory for the ResNet architectures which simultaneously creates a new technique for boosting over features (in contrast with labels) and provides a new algorithm for ResNet-style architectures. Our proposed training algorithm, \emph{BoostResNet}, is particularly suitable in non-differentiable architectures. Our method only requires the relatively inexpensive sequential training of $T$ ``shallow ResNets''. We prove that the training error decays exponentially with the depth $T$ if the weak module classifiers that we train perform slightly better than some weak baseline. In other words, we propose a weak learning condition and prove a boosting theory for ResNet under the weak learning condition. A generalization error bound based on margin theory is proved and suggests that ResNet could be resistant to overfitting using a network with $l_1$ norm bounded weights.

Wed 11 July 8:50 - 9:00 PDT

An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks

Qianxiao Li · IHPC Shuji Hao

Deep learning is formulated as a discrete-time optimal control problem. This allows one to characterize necessary conditions for optimality and developtraining algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discrete-time method of successive approximations (MSA), which is based on the Pontryagin's maximum principle, for training neural networks. A rigorous error estimate for the discrete MSA is obtained, which sheds light on its dynamics and the means to stabilize the algorithm. The developed methods are applied to train, in a rather principled way, neural networks with weights that are constrained to take values in a discrete set.We obtain competitive performance and interestingly, very sparse weights in the case of ternary networks, which may be useful in model deployment in low-memory devices.

Main Navigation

Session

Deep Learning (Theory) 2

Essentially No Barriers in Neural Network Energy Landscape

Comparing Dynamics: Deep Neural Networks versus Glassy Systems

Not All Samples Are Created Equal: Deep Learning with Importance Sampling

Learning Deep ResNet Blocks Sequentially using Boosting Theory

An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks