ICML 2022

Sat 5:50 a.m. - 6:00 a.m.

Introduction and Opening Remarks ( Introduction and Opening Remarks ) >
SlidesLive Video

🔗

Sat 6:00 a.m. - 6:30 a.m.

Neural Scaling of Deep Chemical Models ( Invited Talk ) >
SlidesLive Video

Massive scale, both in terms of data availability and computation, enables significant breakthroughs in key application areas of deep learning such as natural language processing (NLP) and computer vision. There is emerging evidence that scale may be a key ingredient in scientific deep learning, but the importance of physical priors in scientific domains makes the strategies and benefits of scaling uncertain. Here, we investigate neural scaling behavior in large chemical models by varying model and dataset sizes over many orders of magnitude, studying models with over one billion parameters, pre-trained on datasets of up to ten million datapoints. We consider large language models for generative chemistry and graph neural networks for machine-learned interatomic potentials. To enable large-scale scientific deep learning studies under resource constraints, we develop the Training Performance Estimation (TPE) framework to reduce the costs of scalable hyperparameter optimization by up to 90%. Using this framework, we discover empirical neural scaling relations for deep chemical models and investigate the interplay between physical priors and scale. Potential applications of large, pre-trained models for "prompt engineering" and unsupervised representation learning of molecules are shown.

Connor Coley · Nathan C. Frey 🔗

Sat 6:30 a.m. - 7:00 a.m.

Chinchillas, Flamingos, and Gatos: Few-Shot Learning through Pre-training ( Invited Talk ) >
SlidesLive Video

Three of our recent sequence models - Chinchilla, Flamingo, and Gato - leveraged one another and combined several state-of-the-art pre-training techniques. This talk will describe how this combination yielded even stronger capabilities to achieving complex tasks, in the few-shot setting, beyond what could have been expected from their training regimes.

Oriol Vinyals 🔗

Sat 7:00 a.m. - 7:15 a.m.

Multimodal Masked Autoencoders Learn Transferable Representations ( Oral ) >
SlidesLive Video

Xinyang Geng · Hao Liu · Lisa Lee · Dale Schuurmans · Sergey Levine · Pieter Abbeel 🔗

Sat 7:15 a.m. - 7:45 a.m.

How Neural Networks See, Learn and Forget ( Invited Talk ) >
SlidesLive Video

Neural networks have been at the heart of machine learning breakthroughs for over a decade. But in just the past couple of years, new advances in model architectures, pretraining and scaling challenge our assumptions on how they function. In this talk I provide some insights into the workings of modern machine learning. Motivated by the ubiquity of Transformer architectures across tasks and data modalities, I discuss the recent successes of Transformers in computer vision and key similarities and differences to convolutional architectures. Next, I overview some of the salient properties of pretraining on Transformer representations and the effect of scale. I draw connections to results on catastrophic forgetting, the way in which forgetting manifests in representations and new mitigation methods suggested by these insights. I conclude with some open questions in these directions.

Maithra Raghu 🔗

Sat 7:45 a.m. - 8:15 a.m.

Program Synthesis, Program Semantics, and Large Language Models ( Invited Talk ) >
SlidesLive Video

I will describe our experience with two generations of large language models for code at Google. These models show a range of abilities, including generating small programs from natural language descriptions and engaging in dialog about code, incorporating human feedback to improve solutions. However, in a deeper sense, these models seem not to understand the code that they write, in the sense that they are generally unable to predict the output of a program given a specific input. I will discuss our subsequent efforts to improve the "code understanding" abilities of LMs, by asking them to emit intermediate computation steps as tokens onto a "scratchpad". These same models are able to perform complex multi-step computations when asked to perform the operation "step by step", showing the results of intermediate computations, even operations that the LM could not perform directly.

Charles Sutton 🔗

Sat 8:15 a.m. - 9:15 a.m.

Panel Discussion ( Panel Discussion ) >
SlidesLive Video

🔗

Sat 10:30 a.m. - 11:00 a.m.

Exploring the Limits of Large Scale Pre-training ( Invited Talk ) >

Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models.

Hanie Sedghi 🔗

Sat 11:00 a.m. - 11:15 a.m.

Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Prior ( Oral ) > link
SlidesLive Video

Deep learning is increasingly moving towards a transfer learning paradigm whereby large ``foundation models'' are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task. %, and would not affect the final solution at all if we do a good job of optimization. Instead, we show that we can learn highly informative posteriors from the source task, which serves as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on various downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies.

Sat 5:50 a.m. - 6:00 a.m.	Introduction and Opening Remarks ( Introduction and Opening Remarks ) > SlidesLive Video	🔗
Sat 6:00 a.m. - 6:30 a.m.	Neural Scaling of Deep Chemical Models ( Invited Talk ) > SlidesLive Video Massive scale, both in terms of data availability and computation, enables significant breakthroughs in key application areas of deep learning such as natural language processing (NLP) and computer vision. There is emerging evidence that scale may be a key ingredient in scientific deep learning, but the importance of physical priors in scientific domains makes the strategies and benefits of scaling uncertain. Here, we investigate neural scaling behavior in large chemical models by varying model and dataset sizes over many orders of magnitude, studying models with over one billion parameters, pre-trained on datasets of up to ten million datapoints. We consider large language models for generative chemistry and graph neural networks for machine-learned interatomic potentials. To enable large-scale scientific deep learning studies under resource constraints, we develop the Training Performance Estimation (TPE) framework to reduce the costs of scalable hyperparameter optimization by up to 90%. Using this framework, we discover empirical neural scaling relations for deep chemical models and investigate the interplay between physical priors and scale. Potential applications of large, pre-trained models for "prompt engineering" and unsupervised representation learning of molecules are shown.	Connor Coley · Nathan C. Frey 🔗
Sat 6:30 a.m. - 7:00 a.m.	Chinchillas, Flamingos, and Gatos: Few-Shot Learning through Pre-training ( Invited Talk ) > SlidesLive Video Three of our recent sequence models - Chinchilla, Flamingo, and Gato - leveraged one another and combined several state-of-the-art pre-training techniques. This talk will describe how this combination yielded even stronger capabilities to achieving complex tasks, in the few-shot setting, beyond what could have been expected from their training regimes.	Oriol Vinyals 🔗
Sat 7:00 a.m. - 7:15 a.m.	Multimodal Masked Autoencoders Learn Transferable Representations ( Oral ) > SlidesLive Video	Xinyang Geng · Hao Liu · Lisa Lee · Dale Schuurmans · Sergey Levine · Pieter Abbeel 🔗
Sat 7:15 a.m. - 7:45 a.m.	How Neural Networks See, Learn and Forget ( Invited Talk ) > SlidesLive Video Neural networks have been at the heart of machine learning breakthroughs for over a decade. But in just the past couple of years, new advances in model architectures, pretraining and scaling challenge our assumptions on how they function. In this talk I provide some insights into the workings of modern machine learning. Motivated by the ubiquity of Transformer architectures across tasks and data modalities, I discuss the recent successes of Transformers in computer vision and key similarities and differences to convolutional architectures. Next, I overview some of the salient properties of pretraining on Transformer representations and the effect of scale. I draw connections to results on catastrophic forgetting, the way in which forgetting manifests in representations and new mitigation methods suggested by these insights. I conclude with some open questions in these directions.	Maithra Raghu 🔗
Sat 7:45 a.m. - 8:15 a.m.	Program Synthesis, Program Semantics, and Large Language Models ( Invited Talk ) > SlidesLive Video I will describe our experience with two generations of large language models for code at Google. These models show a range of abilities, including generating small programs from natural language descriptions and engaging in dialog about code, incorporating human feedback to improve solutions. However, in a deeper sense, these models seem not to understand the code that they write, in the sense that they are generally unable to predict the output of a program given a specific input. I will discuss our subsequent efforts to improve the "code understanding" abilities of LMs, by asking them to emit intermediate computation steps as tokens onto a "scratchpad". These same models are able to perform complex multi-step computations when asked to perform the operation "step by step", showing the results of intermediate computations, even operations that the LM could not perform directly.	Charles Sutton 🔗
Sat 8:15 a.m. - 9:15 a.m.	Panel Discussion ( Panel Discussion ) > SlidesLive Video	🔗
Sat 10:30 a.m. - 11:00 a.m.	Exploring the Limits of Large Scale Pre-training ( Invited Talk ) > Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models.	Hanie Sedghi 🔗
Sat 11:00 a.m. - 11:15 a.m.	Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Prior ( Oral ) > link SlidesLive Video Deep learning is increasingly moving towards a transfer learning paradigm whereby large ``foundation models'' are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task. %, and would not affect the final solution at all if we do a good job of optimization. Instead, we show that we can learn highly informative posteriors from the source task, which serves as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on various downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies. Link	Ravid Shwartz-Ziv · Micah Goldblum · Hossein Souri · Sanyam Kapoor · Chen Zhu · Yann LeCun · Andrew Wilson 🔗
Sat 11:15 a.m. - 11:45 a.m.	Simplifying and Simplifying Self-Supervised Visual Representation Pre-Training ( Invited Talk ) > SlidesLive Video In this talk, I am going to cover our recent works in the self-supervised learning space for visual representation pre-training. First is SimSiam, a non-contrastive, momentum-free framework that to our supervise, can successfully avoid trivial solutions and achieve very competitive performance to more complicated methods like MoCo. Second is Masked Autoencoder (MAE), which simply and directly reconstructs input signals by predicting natural image patches as a further simplification of self-supervised frameworks for computer vision. MAE adopts a BERT-like algorithm with crucial changes for images, and exhibits BERT-like scaling behaviors, among other intriguing properties different from contrastive learning.	Xinlei Chen 🔗
Sat 11:45 a.m. - 12:00 p.m.	Plex: Towards Reliability using Pretrained Large Model Extensions ( Oral ) > link SlidesLive Video A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 10 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions (plex) for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across tasks, and simplifies the traditional protocol as it does not require designing scores or tuning the model for each individual task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on challenging tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. Link	Dustin Tran · Andreas Kirsch · Balaji Lakshminarayanan · Huiyi Hu · Du Phan · D. Sculley · Jasper Snoek · Jeremiah Liu · JIE REN · Joost van Amersfoort · Kehang Han · Estefany Kelly Buchanan · Kevin Murphy · Mark Collier · Michael Dusenberry · Neil Band · Nithum Thain · Rodolphe Jenatton · Tim G. J Rudner · Yarin Gal · Zachary Nado · Zelda Mariet · Zi Wang · Zoubin Ghahramani 🔗
Sat 12:00 p.m. - 1:30 p.m.	Poster Session ( Poster Session ) >	🔗
Sat 1:30 p.m. - 2:00 p.m.	Unified and Efficient Multimodal Pretraining across Vision and Language ( Invited Talk ) > SlidesLive Video	Mohit Bansal 🔗
Sat 2:00 p.m. - 2:30 p.m.	Benefits and Challenges of Pre-training for Environmental Monitoring ( Invited Talk ) > SlidesLive Video We require systems to monitor species in real time and in greater detail to quickly understand which conservation and sustainability efforts are most effective and take corrective action. Current ecological monitoring systems generate data far faster than researchers can analyze it, making scaling up impossible without automated data processing. Pre-training, particularly methods that require minimal human supervision, is clearly well-aligned with this problem setting where large amounts of unlabeled data are available. However, ecological data collected in the field presents a number of challenges that current pre-training methods are often not designed to tackle. These include strong spatiotemporal correlations and domain shifts, imperfect data quality, fine-grained categories, and long-tailed distributions. I will discuss gaps between the current pre-training paradigm and what is needed for usable, impactful computer vision based environmental monitoring systems, and outline several interesting future directions at the intersection of pre-training and environmental monitoring.	Sara Beery 🔗
-	Efficient Task Adaptation by Mixing Discovered Skills ( Poster ) > link Unsupervised skill discovery is one of the approaches by which the agent learns potentially useful and distinct behaviors without any explicit reward. The agent is then expected to quickly solve downstream tasks by properly using a set of discovered skills rather than learning everything from scratch.However, it is non-trivial to optimally utilize the discovered skills for each task, which can be viewed as a fine-tuning method, and this has been less considered in the literature in spite of its importance. In this paper, we compare some fine-tuning methods showing how they inefficiently utilize the discovered skills and also propose new methods, which are sample-efficient and effective by interpreting the skills as a perspective of how an agent transforms the input state. Our code is available at https://anonymous.4open.science/r/unsupervisedRL-87F3 Link	Eunseok Yang · JUNGSUB RHIM · Taesup Kim 🔗
-	Non-Markovian Policies for Unsupervised Reinforcement Learning in Multiple Environments ( Poster ) > link In recent years, the area of Unsupervised Reinforcement Learning (URL) has gained particular relevance as a way to foster generalization of reinforcement learning agents. In this setting, the agent's policy is first pre-trained in an unknown environment via reward-free interactions, often through a pure exploration objective that drives the agent towards a uniform coverage of the state space. It has been shown that this pre-training leads to improved efficiency in downstream supervised tasks later given to the agent to solve. When dealing with the unsupervised pre-training in multiple environments one should also account for potential trade-offs in the exploration performance within the set of environments, which leads to the following question: Can we pre-train a policy that is simultaneously optimal in all the environments? In this work, we address this question by proposing a novel non-Markovian policy architecture to be pre-trained with the common maximum state entropy objective. This architecture showcases significant empirical advantages when compared to state-of-the-art Markovian agents for URL. Link	Pietro Maldini · Mirco Mutti · Riccardo De Santi · Marcello Restelli 🔗
-	On the Importance of Hyperparameters and Data Augmentation for Self-Supervised Learning ( Poster ) > link Self-Supervised Learning (SSL) has become a very active area of Deep Learning research where it is heavily used as a pre-training method for classification and other tasks. However, the rapid pace of advancements in this area comes at a price: training pipelines vary significantly across papers, which presents a potentially crucial confounding factor. Here, we show that, indeed, the choice of hyperparameters and data augmentation strategies can have a dramatic impact on performance. To shed light on these neglected factors and help maximize the power of SSL, we hyperparameterize these components and optimize them with Bayesian optimization, showing improvements across multiple datasets for the SimSiam SSL approach. Realizing the importance of data augmentations for SSL, we also introduce a new automated data augmentation algorithm, GroupAugment, which considers groups of augmentations and optimizes the sampling across groups. In contrast to algorithms designed for supervised learning, GroupAugment achieved consistently high linear evaluation accuracy across all datasets we considered. Overall, our results indicate the underestimated role of data augmentation for SSL. Link	Diane Wagner · Fabio Ferreira · Danny Stoll · Robin Tibor Schirrmeister · Samuel Gabriel Müller · Frank Hutter 🔗
-	Learning Large-scale Universal User Representation with Sparse Mixture of Experts ( Poster ) > link Learning user sequence behaviour embedding is very sophisticated and challenging due to the complicate feature interaction over time and high dimension of user features. Recent emerging foundation models \textit{e}.\textit{g}. BERT and its variants, encourage a large body of researchers to investigate in this field. However, unlike natural language processing(NLP) tasks, the parameters of user behaviour model comes mostly from user embedding layer which makes most existing works fail to train an universal user embedding at large scale. Furthermore, user representations are learned from multiple downstream tasks, the past research did not address the seesaw phenomenon.In this paper, we propose SUPERMOE, a generic framework for obtain high quality user representation from multiple tasks. Specifically, the user behaviour sequences are encoded by MoE transformer, thus we can improve the model capacity to billions of parameters even trillions. In order to deal with seesaw phenomenon when learning across multiple tasks, we design a new loss function with task indicators. We perform extensive offline experiments on public datasets and online experiments on private real world business scenarios. Our approach achieves best performance over state-of-art models, the results demonstrate the effectiveness of our user behaviour representation framework using MOE transformer. Link	Caigao Jiang · Siqiao Xue · James Zhang · Lingyue Liu · Zhibo Zhu · Hongyan Hao 🔗
-	Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? ( Poster ) > link Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark. To address this, we propose a novel self-supervised representation learning method Representation Learning via Invariant Causal Mechanisms v2 (ReLICv2) (based on ReLIC (Mitrovic et al., 2021)) which explicitly enforces invariance over spurious features such as background and object style. We conduct an extensive experimental evaluation across a varied set of datasets, learning settings and tasks. ReLICv2 achieves 77.1% top-1 accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous state-of-the-art self-supervised approaches by a wide margin. Moreover, we show a relative overall improvement exceeding +5% over the supervised baseline in the transfer setting and the ability to learn more robust representations than self-supervised and supervised models. Most notably, ReLICv2 is the first unsupervised representation learning method to consistently outperform a standard supervised baseline in a like-for-like comparison across a wide range of ResNet architectures. Finally, we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers. Link	Nenad Tomasev · Ioana Bica · Brian McWilliams · Lars Buesing · Razvan Pascanu · Charles Blundell · Jovana Mitrovic 🔗
-	How robust are pre-trained models to distribution shift? ( Poster ) > link The vulnerability of machine learning models to spurious correlations has mostly been discussed in the context of supervised learning (SL). However, there is a lack of insight on how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE). In this work, we shed light on this by evaluating the performance of these models on both real world and synthetic distribution shift datasets. Following observations that the linear head itself can be susceptible to spurious correlations, we develop a new evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation. With this new methodology, we show that SSL models are consistently more robust to distribution shifts and thus better at OOD generalisation than AE and SL models. Link	Yuge Shi · Imant Daunhawer · Julia Vogt · Phil Torr · Amartya Sanyal 🔗
-	Multimodal Masked Autoencoders Learn Transferable Representations ( Poster ) > link Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. We demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data. Link	Xinyang Geng · Hao Liu · Lisa Lee · Dale Schuurmans · Sergey Levine · Pieter Abbeel 🔗
-	Is Self-Supervised Contrastive Learning More Robust Than Supervised Learning? ( Poster ) > link Self-supervised contrastive pre-training is a powerful tool to learn visual representation without human labels. Prior works have primarily focused on the recognition accuracy of contrastive learning but have overlooked other behavioral aspects. Besides accuracy, robustness plays a critical role in machine learning's reliability. We design and conduct a series of robustness tests to quantify the robustness difference between contrastive learning and supervised learning. These tests leverage data corruptions at multiple levels, ranging from pixel-level to patch-level and dataset-level, of either downstream or pre-training data. Our tests unveil intriguing robustness behaviors of contrastive and supervised learning. On one hand, under downstream corruptions, contrastive learning is surprisingly more robust than supervised learning. On the other hand, under pre-training corruptions, contrastive learning is vulnerable to patch shuffling and pixel intensity change, yet less sensitive to dataset-level distribution change. We analyze these results through the role of data augmentation and feature properties which have implications on improving supervised pre-training's downstream robustness. Link	Yuanyi Zhong · Haoran Tang · Junkun Chen · Jian Peng · Yu-Xiong Wang 🔗
-	Leader-based Pre-training Framework for Cooperative Multi-Agent Reinforcement Learning ( Poster ) > link A leader in the team enables efficient learning for other novices in the social learning setting for both humans and animals. This paper constructs the leader-based pre-training framework for Multi-Agent Reinforcement Learning and investigates whether the leader enables the learning of novices as well. We compare three different approaches to distilling a leader's experiences from the pre-training model: Linear Layer Dimension Reduction, Attentive Graph Pooling, and Attention-based Graph Neural Network. We successfully show that a leader-based pre-training framework can 1) enable agents to learn faster, cooperate more effectively, and escape local optimum, and 2) promote the generalizability of agents in more challenging and unseen environments. The key to effective distillation is to maintain and aggregate important information. Link	Wenqi Chen · Xin Zeng · Amber Li 🔗
-	Pixel-level Correspondence for Self-Supervised Learning from Video ( Poster ) > link While self-supervised learning has enabled effective representation learning in the absence of labels, for vision, video remains a relatively untapped source of supervision. To address this, we propose Pixel-level Correspondence (PiCo), a method for dense contrastive learning from video. By tracking points with optical flow, we obtain a correspondence map which can be used to match local features at different points in time. We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks, without compromising performance on image classification. Link	Yash Sharma · Yi Zhu · Chris Russell · Thomas Brox 🔗
-	Pre-Training on a Data Diet: Identifying Sufficient Examples for Early Training ( Poster ) > link A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that—after just a few hundred steps of dense training—the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e., random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP through the lens of the data distribution. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Combined, these results provide new insight into the role played by data in the early phase of training. Link	Mansheej Paul · Brett Larsen · Surya Ganguli · Jonathan Frankle · Gintare Karolina Dziugaite 🔗
-	Enhancing Multi-hop Connectivity for Graph Convolutional Networks ( Poster ) > link Graph Convolutional Network and many of its variants are known to suffer from the dilemma between model depth and over-smoothing issues. Stacking layers of GCN usually lead to the exponential expansion of the receptive field (i.e., high-order neighbors). In order to incorporate the information from high-order neighbors to learn node representations without drastically increasing the number of graph convolution layers, we propose a simple and effective pre-processing technique to increase graph connectivity. Our approach selectively inserts connections between center nodes and informative high-order neighbors, with learnable weights to control the information flow through the connection. Experiments show that our approach improves the performance of GCN, and reduce the depth of GCNII without sacrificing its performance. Besides, our proposed homophily-based weight assignment can mitigate the effect of graph structural attacks. Link	Songtao Liu · Shixiong Jing · Tong Zhao · Zengfeng Huang · Dinghao Wu 🔗
-	Investigating Why Contrastive Learning Benefits Robustness against Label Noise ( Poster ) > link Self-supervised contrastive learning has recently been shown to be very effective in preventing deep networks from overfitting noisy labels. Despite its empirical success, the theoretical understanding of the effect of contrastive learning on boosting robustness is very limited. In this work, we rigorously prove that learned the representation matrix has certain desirable properties in terms its SVD that benefit robustness against label noise. We further show that the low-rank structure of the Jacobian of deep networks pre-trained with contrastive learning allows them to achieve a superior performance initially, when fine-tuned on noisy labels. Finally, we demonstrate that the initial robustness provided by contrastive learning enables robust training methods to achieve state-of-the-art performance under extreme noise levels. Link	Yihao Xue · Kyle Whitecross · Baharan Mirzasoleiman 🔗
-	Pretraining a Neural Network before Knowing Its Architecture ( Poster ) > link Training large neural networks is possible by training a smaller hypernetwork that predicts parameters for the large ones. A recently released Graph HyperNetwork (GHN) trained this way on one million of smaller ImageNet architectures is able to predict parameters for large unseen networks such as ResNet-50. While networks with predicted parameters lose performance on the source task, the predicted parameters have been found useful for fine-tuning on other tasks. We study if fine-tuning based on the same GHN is still useful on novel strong architectures that were published after the GHN had been trained. We found that for recent architectures such as ConvNeXt, GHN initialization becomes less useful than for ResNet-50. One potential reason is the increased distribution shift of novel architectures from those used to train the GHN. We also found that the predicted parameters lack the diversity necessary to successfully fine-tune parameters with gradient descent. We alleviate this limitation by applying simple post-processing techniques to predicted parameters before fine-tuning them on a target task and improve fine-tuning of ResNet-50 and ConvNeXt. Link	Boris Knyazev 🔗
-	Improved Logical Reasoning of Language Models via Differentiable Symbolic Programming ( Poster ) > link Pre-trained large language models (LMs) struggle to perform logical reasoning reliably despite advances in scale and compositionality. In this work, we tackle this challenge through the lens of symbolic programming. We propose DSR-LM, a Differentiable Symbolic Reasoning framework where pre-trained LMs govern the perception of factual knowledge, and a symbolic module equipped with provenance generates top-k proofs by deductive reasoning. In contrast to works that rely on hand-crafted logic rules, our differentiable symbolic reasoning architecture efficiently learns weighted rules to further improve LMs. DSR-LM is scalable, interpretable, and allows easy integration of prior knowledge, thereby supporting extensive symbolic programming to robustly derive a logical conclusion. Our experiments show that DSR-LM leads to improved logical reasoning of pre-trained LMs and outperforms a spectrum of competitive baselines even under systematic distribution shifts on sequence lengths. Link	Hanlin Zhang · Ziyang Li · Jiani Huang · Mayur Naik · Eric Xing 🔗
-	Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Prior ( Poster ) > link Deep learning is increasingly moving towards a transfer learning paradigm whereby large ``foundation models'' are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task. %, and would not affect the final solution at all if we do a good job of optimization. Instead, we show that we can learn highly informative posteriors from the source task, which serves as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on various downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies. Link	Ravid Shwartz-Ziv · Micah Goldblum · Hossein Souri · Sanyam Kapoor · Chen Zhu · Yann LeCun · Andrew Wilson 🔗
-	How well do contrastively trained models transfer? ( Poster ) > link There are two prevailing methods for pre-training on large datasets to learn transferable representations: 1) supervised pre-training on large but weakly-labeled datasets; 2) contrastively training on image only and image, text pairs. While supervised pre-training learns good representations that can be transferred to a wide range of tasks, contrastively models such as CLIP have demonstrated unprecedented zero-shot transfer. In this work, we compare the transferability of the two aforementioned methods to multiple downstream tasks. The pre-training distributions we consider include YFCC, Conceptual Captions, and ImageNet-21K while pre-training objectives range from supervised to SimCLR, CLIP, and SLIP. We observe that different pre-training methods with the same training source transfer similarly given their ImageNet accuracy. Link	M. Moein Shariatnia · Rahim Entezari · Mitchell Wortsman · Olga Saukh · Ludwig Schmidt 🔗
-	Vote for Nearest Neighbors Meta-Pruning of Self-Supervised Networks ( Poster ) > link Pruning plays an essential role in deploying deep neural nets (DNNs) to the hardware of limited memory or computation. However, current high-quality iterative pruning can create a terrible carbon footprint when compressing a large DNN for a wide variety of devices and tasks. Can we reuse the pruning results on previous tasks to accelerate the pruning for a new task? Can we find a better initialization for a new task? We study this `nearest neighbors meta-pruning'' problem by first investigating different choices of pre-trained models for pruning under limited iterations. Our empirical study reveals several advantages of the self-supervision pre-trained model when pruned for multiple tasks. We further study the overlap of pruned models for similar tasks and how the overlap changes for different layers. Inspired by these discoveries, we develop a simple but strong baseline`Meta-Vote Pruning (MVP)'' that significantly reduces the pruning iterations for a new task by initializing a sub-network from the pruned models of tasks similar to it. In experiments, we demonstrate the advantages of MVP through extensive empirical studies and comparisons with popular pruning methods. Link	Haiyan Zhao · Tianyi Zhou · Guodong Long · Jing Jiang · Chengqi Zhang 🔗
-	On Combining Global and Localized Self-Supervised Models of Speech ( Poster ) > link Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks.In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective -- such as wav2vec-2.0 and HuBERT -- induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL -- a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods. Link	Sri Harsha Dumpala · Chandramouli Shama Sastry · Rudolf Uher · Sageev Oore 🔗
-	Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning ( Poster ) > link We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representation in multi-modal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness. Our code and data are available at https://modalitygap.readthedocs.io/ Link	Weixin Liang · Yuhui Zhang · Yongchan Kwon · Serena Yeung · James Zou 🔗
-	Robustness to Adversarial Gradients: A Glimpse Into the Loss Landscape of Contrastive Pre-training ( Poster ) > link An in-depth understanding of deep neural network generalization can allow machine learning practitioners to design systems more robust to class balance shift, adversarial attacks, and data drift. However, the reasons for better generalization are not fully understood. Recent works provide empirical arguments suggesting flat minima generalize better. While recently proposed contrastive pre-training methods have also been shown to improve generalization, there is an incomplete understanding of the loss landscape of these models and why they generalize well. In this work, we analyze the loss landscape of contrastive trained models on the CIFAR10 dataset by looking at three sharpness measures: (1) the approximate eigenspectrum of the Hessian, (2) (Cε, A)-sharpness, and (3) robustness to adversarial gradients (RAG), a new efficient measure of sharpness. Our findings suggest models fine-tuned after contrastive training favor flatter solutions relative to baseline classifiers trained with a supervised objective. In addition, our proposed metric yields findings consistent with existing works, demonstrating impacts of learning rate and batch size on minima sharpness. Link	Philip Fradkin · Lazar Atanackovic · Michael Zhang 🔗
-	Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models ( Poster ) > link A growing ecosystem of large, open-source foundation models has reduced the labeled data and technical expertise necessary to apply machine learning to many new problems. Yet foundation models pose a clear dual-use risk, indiscriminately reducing the costs of building both harmful and benign machine learning systems. To mitigate this risk, we propose the task blocking paradigm, in which foundation models are trained with an additional mechanism to impede adaptation to harmful tasks while retaining good performance on desired tasks. We call the resulting models self-destructing models, inspired by mechanisms that prevent adversaries from using tools for harmful purposes. We present an algorithm for training self-destructing models leveraging techniques from meta-learning and adversarial learning, showing that it can largely prevent a BERT-based model from learning to perform gender identification without harming the model's ability to perform profession classification. We conclude with a discussion of future directions. Link	Eric Mitchell · Peter Henderson · Christopher Manning · Dan Jurafsky · Chelsea Finn 🔗
-	Flaky Performances when Pre-Training on Relational Databases with a Plan for Future Characterization Efforts ( Poster ) > link We explore the downstream task performances for graph neural network (GNN) self-supervised learning (SSL) methods trained on subgraphs extracted from relational databases (RDBs). Intuitively, this joint use of SSL and GNNs allows us to leverage more of the available data, which could translate to better results. However, while we observe positive transfer in some cases, others showed systematic performance degradation, including some spectacular ones. We hypothesize a mechanism that could explain this behaviour and draft the plan for future work testing it by characterize how much relevant information different strategies can (theoretically and/or empirically) extract from (synthetic and/or real) RDBs. Link	Shengchao Liu · David Vazquez · Jian Tang · Pierre-André Noël 🔗
-	Training strategies with unlabeled and few labeled examples under 1-pixel attack by combining supervised and self-supervised learning ( Poster ) > link Self-supervised learning pre-training exhibited excellent performance on feature learning by using only unlabeled examples. Still, it is not clear how different self-supervised tasks perform under distinct image domains and there are still training issues to be tackled under scenarios of limited labeled data. We investigate two self-supervised tasks: rotation and Barlow Twins, on three distinct image domains, exploring a combination of supervised and self-supervised learning. Our motivation is to work on scenarios where the proportion of labeled data with respect to unlabeled data is small, as well as investigate the model's robustness to 1-pixel attacks. The models that combine supervised with self-supervised tasks can take advantage of the unlabeled data to improve the learned representation in terms of the linear discrimination, as well as allowing learning even under attack. Link	Gabriel Biscaro Cavallari · Moacir Ponti 🔗
-	Plex: Towards Reliability using Pretrained Large Model Extensions ( Poster ) > link A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 10 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions (plex) for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across tasks, and simplifies the traditional protocol as it does not require designing scores or tuning the model for each individual task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on challenging tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. Link	Dustin Tran · Andreas Kirsch · Balaji Lakshminarayanan · Huiyi Hu · Du Phan · D. Sculley · Jasper Snoek · Jeremiah Liu · JIE REN · Joost van Amersfoort · Kehang Han · Estefany Kelly Buchanan · Kevin Murphy · Mark Collier · Michael Dusenberry · Neil Band · Nithum Thain · Rodolphe Jenatton · Tim G. J Rudner · Yarin Gal · Zachary Nado · Zelda Mariet · Zi Wang · Zoubin Ghahramani 🔗
-	Contrastive Learning Can Find An Optimal Basis For Approximately Invariant Functions ( Poster ) > link Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpeted as learning a positive-definite kernel that approximates a particular contrastive kernel defined by the positive pairs. The principal components of the data under this kernel exactly correspond to the eigenfunctions of a positive-pair Markov chain, and these eigenfunctions can be used to build a representation thatprovably minimizes the worst-case approximation error of linear predictors under the assumption that positive pairs have similar labels. We give generalization bounds for downstream linear prediction using this optimal representation, and show how to approximate this representation using kernel PCA. We also explore kernel-based representations on a noisy MNIST task for which the positive pair distribution has a closed form, and compare the properties of the true eigenfunctions with their learned approximations. Link	Daniel D. Johnson · Daniel D. Johnson · Ayoub El Hanchi · Ayoub El Hanchi · Chris Maddison · Chris Maddison 🔗

Workshop

The First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward

Huaxiu Yao · Hugo Larochelle · Percy Liang · Colin Raffel · Jian Tang · Ying WEI · Saining Xie · Eric Xing · Chelsea Finn

Hall F

Schedule