Workshop: Beyond first order methods in machine learning systems

Talk by Rio Yokota - Degree of Approximation and Overhead of Computing Curvature, Information, and Noise Matrices

Rio Yokota


Hessian, Fisher, and Covariance matrices are not only used for preconditioning optimizers, but also in generalization metrics, predicting hyperparameters, and Bayesian inference. These matrices contain valuable information that can advance theory in statistical learning, but they are very expensive to compute exactly for modern deep neural networks with billions of parameters. We make use of a highly optimized implementation for computing these matrices with various degrees of approximation to close the gap between theory and practice. We are able to significantly reduce the overhead of computing these matrices through a hybrid data-parallel + model-parallel approach.

Chat is not available.