Timezone: »

Principled Approaches to Deep Learning
Andrzej Pronobis · Robert Gens · Sham Kakade · Pedro Domingos

Wed Aug 09 03:30 PM -- 12:30 AM (PDT) @ C4.5
Event URL: http://padl.ws »

The recent advancements in deep learning have revolutionized the field of machine learning, enabling unparalleled performance and many new real-world applications. Yet, the developments that led to this success have often been driven by empirical studies, and little is known about the theory behind some of the most successful approaches. While theoretically well-founded deep learning architectures had been proposed in the past, they came at a price of increased complexity and reduced tractability. Recently, we have witnessed considerable interest in principled deep learning. This led to a better theoretical understanding of existing architectures as well as development of more mature deep models with solid theoretical foundations. In this workshop, we intend to review the state of those developments and provide a platform for the exchange of ideas between the theoreticians and the practitioners of the growing deep learning community. Through a series of invited talks by the experts in the field, contributed presentations, and an interactive panel discussion, the workshop will cover recent theoretical developments, provide an overview of promising and mature architectures, highlight their challenges and unique benefits, and present the most exciting recent results.

Wed 3:30 p.m. - 3:45 p.m.
Welcome and Opening Remarks (Talk)
Wed 3:45 p.m. - 4:15 p.m.

Do GANs Actually Learn the Distribution? Some Theory and Empirics

The Generative Adversarial Nets or GANs framework (Goodfellow et al'14) for learning distributions differs from older ideas such as autoencoders and deep Boltzmann machines in that it scores the generated distribution using a discriminator net, instead of a perplexity-like calculation. It appears to work well in practice, e.g., the generated images look better than older techniques. But how well do these nets learn the target distribution?

Our paper 1 (ICML'17) shows GAN training may not have good generalization properties; e.g., training may appear successful but the trained distribution may be far from target distribution in standard metrics. We show theoretically that this can happen even though the 2-person game between discriminator and generator is in near-equilibrium, where the generator appears to have "won" (with respect to natural training objectives).

Paper2 (arxiv June 26) empirically tests the whether this lack of generalization occurs in real-life training. The paper introduces a new quantitative test for diversity of a distribution based upon the famous birthday paradox. This test reveals that distributions learnt by some leading GANs techniques have fairly small support (i.e., suffer from mode collapse), which implies that they are far from the target distribution.

Paper 1: "Equilibrium and Generalization in GANs" by Arora, Ge, Liang, Ma, Zhang. (ICML 2017)

Paper 2: "Do GANs actually learn the distribution? An empirical study." by Arora and Zhang (https://arxiv.org/abs/1706.08224)

Wed 4:15 p.m. - 4:30 p.m.

Towards a Deeper Understanding of Training Quantized Neural Networks

Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, Tom Goldstein

Training neural networks with coarsely quantized weights is a key step towards learning on embedded platforms that have limited computing resources, memory capacity, and power consumption. Numerous recent publications have studied methods for training quantized networks, but these studies have been purely experimental. In this work, we investigate the theory of training quantized neural networks. We analyze the convergence properties of commonly used quantized training methods. We also show that training algorithms that exploit high-precision representations have an important annealing property that purely quantized training methods lack, which explains many of the observed empirical differences between these types of algorithms.

Wed 4:30 p.m. - 5:00 p.m.

On the Beneficial Role of Dynamic Criticality and Chaos in Deep Learning

What does a generic deep function “look like” and how can we understand and exploit such knowledge to obtain practical benefits in deep learning? By combining Riemannian geometry with dynamic mean field theory, we show that generic nonlinear deep networks exhibit an order to chaos phase transition as synaptic weights vary from small to large. In the chaotic phase, deep networks acquire very high expressive power: measures of functional curvature and the ability to disentangle classification boundaries both grow exponentially with depth, but not with width. Moreover, we apply tools from free probability theory to study the propagation of error gradients through generic deep networks. We find, at the phase transition boundary between order and chaos, that not only the norms of gradients, but also angles between pairs of gradients are preserved even in infinitely deep sigmoidal networks with orthogonal weights. In contrast, ReLu networks do not enjoy such isometric propagation of gradients. In turn, this isometric propagation at the edge of chaos leads to training benefits, where very deep sigmoidal networks outperform ReLu networks, thereby pointing to a potential path to resurrecting saturating nonlinearities in deep learning.

Wed 5:00 p.m. - 5:45 p.m.
Coffee Break and Poster Session (Break)
Wed 5:45 p.m. - 6:15 p.m.

Neural Map: Structured Memory for Deep Reinforcement Learning

A critical component to enabling intelligent reasoning in partially observable environments is memory. Despite this importance, Deep Reinforcement Learning (DRL) agents have so far used relatively simple memory architectures, with the main methods to overcome partial observability being either a temporal convolution over the past k frames or an LSTM layer. In this talk, we will introduce a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with. This architecture, called the Neural Map, uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags. We demonstrate empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments and show that it is capable of generalizing to environments that were not seen during training.

Joint work with Emilio Parisotto

Wed 6:15 p.m. - 6:45 p.m.

The Sum-Product Theorem: A Foundation for Learning Tractable Deep Models

Inference in expressive probabilistic models is generally intractable, which makes them difficult to learn and limits their applicability. Sum-product networks are a class of deep models where, surprisingly, inference remains tractable even when an arbitrary number of hidden layers are present. In this talk, I generalize this result to a much broader set of learning problems: all those where inference consists of summing a function over a semiring. This includes satisfiability, constraint satisfaction, optimization, integration, and others. In any semiring, for summation to be tractable it suffices that the factors of every product have disjoint scopes. This unifies and extends many previous results in the literature. Enforcing this condition at learning time thus ensures that the learned models are tractable. I illustrate the power and generality of this approach by applying it to a new type of structured prediction problem: learning a nonconvex function that can be globally optimized in polynomial time. I show empirically that this greatly outperforms the standard approach of learning without regard to the cost of optimization. (Joint work with Abram Friesen)

Wed 6:45 p.m. - 7:00 p.m.

LibSPN: A Library for Learning and Inference with Sum-Product Networks and TensorFlow

Andrzej Pronobis, Avinash Ranganath, Rajesh Rao

Sum-Product Networks (SPNs) are a probabilistic deep architecture with solid theoretical foundations, which demonstrated state-of-the-art performance in several domains. Yet, surprisingly, there are no mature, general-purpose SPN implementations that would serve as a platform for the community of machine learning researchers centered around SPNs. Here, we present a new general-purpose Python library called LibSPN, which aims to become such a platform. The library is designed to make it straightforward and effortless to apply various SPN architectures to large-scale datasets and problems. The library achieves scalability and efficiency, thanks to a tight coupling with TensorFlow, a framework already used by a large community of researchers and developers in multiple domains. We describe the design and benefits of LibSPN, give several use-case examples, and demonstrate the applicability of the library to real-world problems on the example of spatial understanding in mobile robotics.

Wed 7:00 p.m. - 8:30 p.m.
Lunch (Break)
Wed 8:30 p.m. - 9:00 p.m.
Invited Talk 5 - Tomaso Poggio (Talk)
Wed 9:00 p.m. - 9:15 p.m.

Emergence of invariance and disentangling in deep representations

Alessandro Achille, Stefano Soatto

We show that invariance in a deep neural network is equivalent to the information minimality of the representation it computes, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. Then, we show that overfitting is related to the quantity of information stored in the weights, and derive a sharp bound between this information and the minimality and Total Correlation of the layers. This allows us to conclude that implicit and explicit regularization of the loss function not only help limit overfitting, but also foster invariance and disentangling of the learned representation. We also shed light on the properties of deep networks in relation to the geometry of the loss function.

Wed 9:15 p.m. - 9:45 p.m.

Geometry, Optimization and Generalization in Multilayer Networks

What is it that enables learning with multi-layer networks? What causes the network to generalize well despite the model class having extremely high capacity? In this talk I will explore these questions through experimentation, analogy to matrix factorization (including some new results on the energy landscape and implicit regularization in matrix factorization), and study of alternate geometries and optimization approaches.

Wed 9:45 p.m. - 10:00 p.m.

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

David Balduzzi, Brian McWilliams, Marcus Frean, John Lewis, Lennox Leary, Kurt Wan Duo Ma

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. Although, the problem has largely been overcome via carefully constructed initializations and batch normalization, architectures incorporating skip-connections such as highway and resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, the gradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new "looks linear" (LL) initialization that prevents shattering, with preliminary experiments showing the new initialization allows to train very deep networks without the addition of skip-connections.

Wed 10:00 p.m. - 10:45 p.m.
Coffee Break 2 and Poster Session (Break)
Wed 10:45 p.m. - 11:00 p.m.

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Mądry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu

Recent work has demonstrated that neural networks are vulnerable to adversarial examples, i.e., inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify general methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. These methods let us train networks with significantly improved resistance to a wide range adversarial attacks. This suggests that adversarially resistant deep learning models might be within our reach after all.

Wed 11:00 p.m. - 12:20 a.m.
Panel Discussion
Thu 12:20 a.m. - 12:30 a.m.
Closing Remarks and Awards (Talk)

Author Information

Andrzej Pronobis (University of Washington)

Andrzej Pronobis is a Research Associate in the Department of Computer Science and Engineering at the University of Washington in Seattle, as well as a Senior Researcher at KTH Royal Institute of Technology in Stockholm, Sweden. His research is at the intersection of robotics, deep learning and computer vision, with focus on perception and spatial understanding mechanisms for mobile robots and their role in the interaction between robots and human environments. His recent interests include application of tractable probabilistic deep models to planning and learning semantic spatial representations. He is a recipient of a prestigious Swedish Research Council Grant for Junior Researchers and a finalist for the Georges Giralt Ph.D. award for the best European Ph.D. thesis in robotics.

Robert Gens (Google)
Sham Kakade (University of Washington)

Sham Kakade is a Washington Research Foundation Data Science Chair, with a joint appointment in the Department of Computer Science and the Department of Statistics at the University of Washington, and is a co-director for the Algorithmic Foundations of Data Science Institute. He works on the mathematical foundations of machine learning and AI. Sham's thesis helped in laying the foundations of the PAC-MDP framework for reinforcement learning. With his collaborators, his additional contributions include: one of the first provably efficient policy search methods, Conservative Policy Iteration, for reinforcement learning; developing the mathematical foundations for the widely used linear bandit models and the Gaussian process bandit models; the tensor and spectral methodologies for provable estimation of latent variable models (applicable to mixture of Gaussians, HMMs, and LDA); the first sharp analysis of the perturbed gradient descent algorithm, along with the design and analysis of numerous other convex and non-convex algorithms. He is the recipient of the IBM Goldberg best paper award (in 2007) for contributions to fast nearest neighbor search and the best paper, INFORMS Revenue Management and Pricing Section Prize (2014). He has been program chair for COLT 2011. Sham was an undergraduate at Caltech, where he studied physics and worked under the guidance of John Preskill in quantum computing. He then completed his Ph.D. in computational neuroscience at the Gatsby Unit at University College London, under the supervision of Peter Dayan. He was a postdoc at the Dept. of Computer Science, University of Pennsylvania , where he broadened his studies to include computational game theory and economics from the guidance of Michael Kearns. Sham has been a Principal Research Scientist at Microsoft Research, New England, an associate professor at the Department of Statistics, Wharton, UPenn, and an assistant professor at the Toyota Technological Institute at Chicago.

Pedro Domingos (University of Washington)

More from the Same Authors