## Deep Learning Theory 1

Moderator: Quanquan Gu

Abstract:

### Chat is not available.

Tue 20 July 6:00 - 6:20 PDT
##### Let's Agree to Degree: Comparing Graph Convolutional Networks in the Message-Passing Framework

Floris Geerts · Filip Mazowiecki · Guillermo Perez

In this paper we cast neural networks defined on graphs as message-passing neural networks (MPNNs) to study the distinguishing power of different classes of such models. We are interested in when certain architectures are able to tell vertices apart based on the feature labels given as input with the graph. We consider two variants of MPNNS: anonymous MPNNs whose message functions depend only on the labels of vertices involved; and degree-aware MPNNs whose message functions can additionally use information regarding the degree of vertices. The former class covers popular graph neural network (GNN) formalisms for which the distinguished power is known. The latter covers graph convolutional networks (GCNs), introduced by Kipf and Welling, for which the distinguishing power was unknown. We obtain lower and upper bounds on the distinguishing power of (anonymous and degree-aware) MPNNs in terms of the distinguishing power of the Weisfeiler-Lehman (WL) algorithm. Our main results imply that (i) the distinguishing power of GCNs is bounded by the WL algorithm, but they may be one step ahead; (ii) the WL algorithm cannot be simulated by plain vanilla'' GCNs but the addition of a trade-off parameter between features of the vertex and those of its neighbours (as proposed by Kipf and Welling) resolves this problem.

[ ]
Tue 20 July 6:20 - 6:25 PDT

Mohammad Mehrabi · Adel Javanmard · Ryan A. Rossi · Anup Rao · Tung Mai

[ ]
Tue 20 July 6:25 - 6:30 PDT
##### Towards Understanding Learning in Neural Networks with Linear Teachers

Roei Sarussi · Alon Brutzkus · Amir Globerson

Can a neural network minimizing cross-entropy learn linearly separable data? Despite progress in the theory of deep learning, this question remains unsolved. Here we prove that SGD globally optimizes this learning problem for a two-layer network with Leaky ReLU activations. The learned network can in principle be very complex. However, empirical evidence suggests that it often turns out to be approximately linear. We provide theoretical support for this phenomenon by proving that if network weights converge to two weight clusters, this will imply an approximately linear decision boundary. Finally, we show a condition on the optimization that leads to weight clustering. We provide empirical results that validate our theoretical analysis.

[ ]
Tue 20 July 6:30 - 6:35 PDT
##### Continual Learning in the Teacher-Student Setup: Impact of Task Similarity

Sebastian Lee · Sebastian Goldt · Andrew Saxe

[ ]
Tue 20 July 6:35 - 6:40 PDT
##### A Functional Perspective on Learning Symmetric Functions with Neural Networks

Aaron Zweig · Joan Bruna

Symmetric functions, which take as input an unordered, fixed-size set, are known to be universally representable by neural networks that enforce permutation invariance. These architectures only give guarantees for fixed input sizes, yet in many practical applications, including point clouds and particle physics, a relevant notion of generalization should include varying the input size. In this work we treat symmetric functions (of any size) as functions over probability measures, and study the learning and representation of neural networks defined on measures. By focusing on shallow architectures, we establish approximation and generalization bounds under different choices of regularization (such as RKHS and variation norms), that capture a hierarchy of functional spaces with increasing degree of non-linear learning. The resulting models can be learned efficiently and enjoy generalization guarantees that extend across input sizes, as we verify empirically.

[ ]
Tue 20 July 6:40 - 6:45 PDT
##### Weisfeiler and Lehman Go Topological: Message Passing Simplicial Networks

Cristian Bodnar · Fabrizio Frasca · Yuguang Wang · Nina Otter · Guido Montufar · Pietro Lió · Michael Bronstein

The pairwise interaction paradigm of graph machine learning has predominantly governed the modelling of relational systems. However, graphs alone cannot capture the multi-level interactions present in many complex systems and the expressive power of such schemes was proven to be limited. To overcome these limitations, we propose Message Passing Simplicial Networks (MPSNs), a class of models that perform message passing on simplicial complexes (SCs). To theoretically analyse the expressivity of our model we introduce a Simplicial Weisfeiler-Lehman (SWL) colouring procedure for distinguishing non-isomorphic SCs. We relate the power of SWL to the problem of distinguishing non-isomorphic graphs and show that SWL and MPSNs are strictly more powerful than the WL test and not less powerful than the 3-WL test. We deepen the analysis by comparing our model with traditional graph neural networks (GNNs) with ReLU activations in terms of the number of linear regions of the functions they can represent. We empirically support our theoretical claims by showing that MPSNs can distinguish challenging strongly regular graphs for which GNNs fail and, when equipped with orientation equivariant layers, they can improve classification accuracy in oriented SCs compared to a GNN baseline.

[ ]
Tue 20 July 6:45 - 6:50 PDT
##### On the Random Conjugate Kernel and Neural Tangent Kernel

Zhengmian Hu · Heng Huang

We investigate the distributions of Conjugate Kernel (CK) and Neural Tangent Kernel (NTK) for ReLU networks with random initialization. We derive the precise distributions and moments of the diagonal elements of these kernels. For a feedforward network, these values converge in law to a log-normal distribution when the network depth $d$ and width $n$ simultaneously tend to infinity and the variance of log diagonal elements is proportional to ${d}/{n}$. For the residual network, in the limit that number of branches $m$ increases to infinity and the width $n$ remains fixed, the diagonal elements of Conjugate Kernel converge in law to a log-normal distribution where the variance of log value is proportional to ${1}/{n}$, and the diagonal elements of NTK converge in law to a log-normal distributed variable times the conjugate kernel of one feedforward network. Our new theoretical analysis results suggest that residual network remains trainable in the limit of infinite branches and fixed network width. The numerical experiments are conducted and all results validate the soundness of our theoretical analysis.

[ ]
Tue 20 July 6:50 - 6:55 PDT

[ ]