## The First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward

### Huaxiu Yao · Hugo Larochelle · Percy Liang · Colin Raffel · Jian Tang · Ying WEI · Saining Xie · Eric Xing · Chelsea Finn

##### Hall F
Abstract Workshop Website
Sat 23 Jul, 5:50 a.m. PDT

Abstract:

The past five years have seen rapid progress in large-scale pre-trained models across a variety of domains, such as computer vision, natural language processing, robotics, bioinformatics, etc. Leveraging a huge number of parameters, large-scale pre-trained models are capable of encoding rich knowledge from labeled and/or unlabeled examples. Supervised and self-supervised pre-training have been the two most representative paradigms, through which pre-trained models have demonstrated large benefits on a wide spectrum of downstream tasks. There are also other pre-training paradigms, e.g., meta-learning for few-shot learning, where pre-trained models are trained so that they quickly adapt to solve new tasks. However, there are still many remaining challenges and new opportunities ahead for pre-training, In this workshop, we propose to have the following two foci: (1) Which pre-training methods transfer across different applications/domains, which ones don't, and why? (2) In what settings should we expect pre-training to be effective, compared to learning from scratch?

Chat is not available.
Timezone: America/Los_Angeles »

### Schedule

 Sat 5:50 a.m. - 6:00 a.m. Introduction and Opening Remarks 🔗 Sat 6:00 a.m. - 6:30 a.m. Neural Scaling of Deep Chemical Models (Invited Talk)    Massive scale, both in terms of data availability and computation, enables significant breakthroughs in key application areas of deep learning such as natural language processing (NLP) and computer vision. There is emerging evidence that scale may be a key ingredient in scientific deep learning, but the importance of physical priors in scientific domains makes the strategies and benefits of scaling uncertain. Here, we investigate neural scaling behavior in large chemical models by varying model and dataset sizes over many orders of magnitude, studying models with over one billion parameters, pre-trained on datasets of up to ten million datapoints. We consider large language models for generative chemistry and graph neural networks for machine-learned interatomic potentials. To enable large-scale scientific deep learning studies under resource constraints, we develop the Training Performance Estimation (TPE) framework to reduce the costs of scalable hyperparameter optimization by up to 90%. Using this framework, we discover empirical neural scaling relations for deep chemical models and investigate the interplay between physical priors and scale. Potential applications of large, pre-trained models for "prompt engineering" and unsupervised representation learning of molecules are shown. Connor Coley · Nathan C. Frey 🔗 Sat 6:30 a.m. - 7:00 a.m. Chinchillas, Flamingos, and Gatos: Few-Shot Learning through Pre-training (Invited Talk)    Three of our recent sequence models - Chinchilla, Flamingo, and Gato - leveraged one another and combined several state-of-the-art pre-training techniques. This talk will describe how this combination yielded even stronger capabilities to achieving complex tasks, in the few-shot setting, beyond what could have been expected from their training regimes. Oriol Vinyals 🔗 Sat 7:00 a.m. - 7:15 a.m. Multimodal Masked Autoencoders Learn Transferable Representations (Oral) Xinyang Geng · Hao Liu · Lisa Lee · Dale Schuurmans · Sergey Levine · Pieter Abbeel 🔗 Sat 7:15 a.m. - 7:45 a.m. How Neural Networks See, Learn and Forget (Invited Talk)    Neural networks have been at the heart of machine learning breakthroughs for over a decade. But in just the past couple of years, new advances in model architectures, pretraining and scaling challenge our assumptions on how they function. In this talk I provide some insights into the workings of modern machine learning. Motivated by the ubiquity of Transformer architectures across tasks and data modalities, I discuss the recent successes of Transformers in computer vision and key similarities and differences to convolutional architectures. Next, I overview some of the salient properties of pretraining on Transformer representations and the effect of scale. I draw connections to results on catastrophic forgetting, the way in which forgetting manifests in representations and new mitigation methods suggested by these insights. I conclude with some open questions in these directions. Maithra Raghu 🔗 Sat 7:45 a.m. - 8:15 a.m. Program Synthesis, Program Semantics, and Large Language Models (Invited Talk)    I will describe our experience with two generations of large language models for code at Google. These models show a range of abilities, including generating small programs from natural language descriptions and engaging in dialog about code, incorporating human feedback to improve solutions. However, in a deeper sense, these models seem not to understand the code that they write, in the sense that they are generally unable to predict the output of a program given a specific input. I will discuss our subsequent efforts to improve the "code understanding" abilities of LMs, by asking them to emit intermediate computation steps as tokens onto a "scratchpad". These same models are able to perform complex multi-step computations when asked to perform the operation "step by step", showing the results of intermediate computations, even operations that the LM could not perform directly. Charles Sutton 🔗 Sat 8:15 a.m. - 9:15 a.m. Panel Discussion 🔗 Sat 10:30 a.m. - 11:00 a.m. Exploring the Limits of Large Scale Pre-training (Invited Talk) Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models. Hanie Sedghi 🔗 Sat 11:00 a.m. - 11:15 a.m. Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Prior (Oral)  link »    Deep learning is increasingly moving towards a transfer learning paradigm whereby large foundation models'' are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task. %, and would not affect the final solution at all if we do a good job of optimization. Instead, we show that we can learn highly informative posteriors from the source task, which serves as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on various downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies. Link » Ravid Shwartz-Ziv · Micah Goldblum · Hossein Souri · Sanyam Kapoor · Chen Zhu · Yann LeCun · Andrew Wilson 🔗 Sat 11:15 a.m. - 11:45 a.m. Simplifying and Simplifying Self-Supervised Visual Representation Pre-Training (Invited Talk)    In this talk, I am going to cover our recent works in the self-supervised learning space for visual representation pre-training. First is SimSiam, a non-contrastive, momentum-free framework that to our supervise, can successfully avoid trivial solutions and achieve very competitive performance to more complicated methods like MoCo. Second is Masked Autoencoder (MAE), which simply and directly reconstructs input signals by predicting natural image patches as a further simplification of self-supervised frameworks for computer vision. MAE adopts a BERT-like algorithm with crucial changes for images, and exhibits BERT-like scaling behaviors, among other intriguing properties different from contrastive learning. Xinlei Chen 🔗 Sat 11:45 a.m. - 12:00 p.m. Plex: Towards Reliability using Pretrained Large Model Extensions (Oral)  link »    A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 10 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions (plex) for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across tasks, and simplifies the traditional protocol as it does not require designing scores or tuning the model for each individual task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on challenging tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. Link » Dustin Tran · Andreas Kirsch · Balaji Lakshminarayanan · Huiyi Hu · Du Phan · D. Sculley · Jasper Snoek · Jeremiah Liu · JIE REN · Joost van Amersfoort · Kehang Han · Estefany Kelly Buchanan · Kevin Murphy · Mark Collier · Michael Dusenberry · Neil Band · Nithum Thain · Rodolphe Jenatton · Tim G. J Rudner · Yarin Gal · Zachary Nado · Zelda Mariet · Zi Wang · Zoubin Ghahramani 🔗 Sat 12:00 p.m. - 1:30 p.m. Poster Session 🔗 Sat 1:30 p.m. - 2:00 p.m. Unified and Efficient Multimodal Pretraining across Vision and Language (Invited Talk) Mohit Bansal 🔗 Sat 2:00 p.m. - 2:30 p.m. Benefits and Challenges of Pre-training for Environmental Monitoring (Invited Talk)    We require systems to monitor species in real time and in greater detail to quickly understand which conservation and sustainability efforts are most effective and take corrective action. Current ecological monitoring systems generate data far faster than researchers can analyze it, making scaling up impossible without automated data processing. Pre-training, particularly methods that require minimal human supervision, is clearly well-aligned with this problem setting where large amounts of unlabeled data are available. However, ecological data collected in the field presents a number of challenges that current pre-training methods are often not designed to tackle. These include strong spatiotemporal correlations and domain shifts, imperfect data quality, fine-grained categories, and long-tailed distributions. I will discuss gaps between the current pre-training paradigm and what is needed for usable, impactful computer vision based environmental monitoring systems, and outline several interesting future directions at the intersection of pre-training and environmental monitoring. Sara Beery 🔗 - Efficient Task Adaptation by Mixing Discovered Skills (Poster)  link » Unsupervised skill discovery is one of the approaches by which the agent learns potentially useful and distinct behaviors without any explicit reward. The agent is then expected to quickly solve downstream tasks by properly using a set of discovered skills rather than learning everything from scratch.However, it is non-trivial to optimally utilize the discovered skills for each task, which can be viewed as a fine-tuning method, and this has been less considered in the literature in spite of its importance. In this paper, we compare some fine-tuning methods showing how they inefficiently utilize the discovered skills and also propose new methods, which are sample-efficient and effective by interpreting the skills as a perspective of how an agent transforms the input state. Our code is available at https://anonymous.4open.science/r/unsupervisedRL-87F3 Link » Eunseok Yang · JUNGSUB RHIM · Taesup Kim 🔗 - Non-Markovian Policies for Unsupervised Reinforcement Learning in Multiple Environments (Poster)  link » In recent years, the area of Unsupervised Reinforcement Learning (URL) has gained particular relevance as a way to foster generalization of reinforcement learning agents. In this setting, the agent's policy is first pre-trained in an unknown environment via reward-free interactions, often through a pure exploration objective that drives the agent towards a uniform coverage of the state space. It has been shown that this pre-training leads to improved efficiency in downstream supervised tasks later given to the agent to solve. When dealing with the unsupervised pre-training in multiple environments one should also account for potential trade-offs in the exploration performance within the set of environments, which leads to the following question: Can we pre-train a policy that is simultaneously optimal in all the environments? In this work, we address this question by proposing a novel non-Markovian policy architecture to be pre-trained with the common maximum state entropy objective. This architecture showcases significant empirical advantages when compared to state-of-the-art Markovian agents for URL. Link » Pietro Maldini · Mirco Mutti · Riccardo De Santi · Marcello Restelli 🔗 - On the Importance of Hyperparameters and Data Augmentation for Self-Supervised Learning (Poster)  link » Self-Supervised Learning (SSL) has become a very active area of Deep Learning research where it is heavily used as a pre-training method for classification and other tasks. However, the rapid pace of advancements in this area comes at a price: training pipelines vary significantly across papers, which presents a potentially crucial confounding factor. Here, we show that, indeed, the choice of hyperparameters and data augmentation strategies can have a dramatic impact on performance. To shed light on these neglected factors and help maximize the power of SSL, we hyperparameterize these components and optimize them with Bayesian optimization, showing improvements across multiple datasets for the SimSiam SSL approach. Realizing the importance of data augmentations for SSL, we also introduce a new automated data augmentation algorithm, GroupAugment, which considers groups of augmentations and optimizes the sampling across groups. In contrast to algorithms designed for supervised learning, GroupAugment achieved consistently high linear evaluation accuracy across all datasets we considered. Overall, our results indicate the underestimated role of data augmentation for SSL. Link » Diane Wagner · Fabio Ferreira · Danny Stoll · Robin Tibor Schirrmeister · Samuel Gabriel Müller · Frank Hutter 🔗 - Learning Large-scale Universal User Representation with Sparse Mixture of Experts (Poster)  link » Learning user sequence behaviour embedding is very sophisticated and challenging due to the complicate feature interaction over time and high dimension of user features. Recent emerging foundation models \textit{e}.\textit{g}. BERT and its variants, encourage a large body of researchers to investigate in this field. However, unlike natural language processing(NLP) tasks, the parameters of user behaviour model comes mostly from user embedding layer which makes most existing works fail to train an universal user embedding at large scale. Furthermore, user representations are learned from multiple downstream tasks, the past research did not address the seesaw phenomenon.In this paper, we propose SUPERMOE, a generic framework for obtain high quality user representation from multiple tasks. Specifically, the user behaviour sequences are encoded by MoE transformer, thus we can improve the model capacity to billions of parameters even trillions. In order to deal with seesaw phenomenon when learning across multiple tasks, we design a new loss function with task indicators. We perform extensive offline experiments on public datasets and online experiments on private real world business scenarios. Our approach achieves best performance over state-of-art models, the results demonstrate the effectiveness of our user behaviour representation framework using MOE transformer. Link » Caigao Jiang · Siqiao Xue · James Zhang · Lingyue Liu · Zhibo Zhu · Hongyan Hao 🔗 - Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? (Poster)  link » Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark. To address this, we propose a novel self-supervised representation learning method Representation Learning via Invariant Causal Mechanisms v2 (ReLICv2) (based on ReLIC (Mitrovic et al., 2021)) which explicitly enforces invariance over spurious features such as background and object style. We conduct an extensive experimental evaluation across a varied set of datasets, learning settings and tasks. ReLICv2 achieves 77.1% top-1 accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous state-of-the-art self-supervised approaches by a wide margin. Moreover, we show a relative overall improvement exceeding +5% over the supervised baseline in the transfer setting and the ability to learn more robust representations than self-supervised and supervised models. Most notably, ReLICv2 is the first unsupervised representation learning method to consistently outperform a standard supervised baseline in a like-for-like comparison across a wide range of ResNet architectures. Finally, we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers. Link » Nenad Tomasev · Ioana Bica · Brian McWilliams · Lars Buesing · Razvan Pascanu · Charles Blundell · Jovana Mitrovic 🔗 - How robust are pre-trained models to distribution shift? (Poster)  link » The vulnerability of machine learning models to spurious correlations has mostly been discussed in the context of supervised learning (SL). However, there is a lack of insight on how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE). In this work, we shed light on this by evaluating the performance of these models on both real world and synthetic distribution shift datasets. Following observations that the linear head itself can be susceptible to spurious correlations, we develop a new evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation. With this new methodology, we show that SSL models are consistently more robust to distribution shifts and thus better at OOD generalisation than AE and SL models. Link » Yuge Shi · Imant Daunhawer · Julia Vogt · Phil Torr · Amartya Sanyal 🔗 - Multimodal Masked Autoencoders Learn Transferable Representations (Poster)  link » Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. We demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data. Link » Xinyang Geng · Hao Liu · Lisa Lee · Dale Schuurmans · Sergey Levine · Pieter Abbeel 🔗 - Is Self-Supervised Contrastive Learning More Robust Than Supervised Learning? (Poster)  link » Self-supervised contrastive pre-training is a powerful tool to learn visual representation without human labels. Prior works have primarily focused on the recognition accuracy of contrastive learning but have overlooked other behavioral aspects. Besides accuracy, robustness plays a critical role in machine learning's reliability. We design and conduct a series of robustness tests to quantify the robustness difference between contrastive learning and supervised learning. These tests leverage data corruptions at multiple levels, ranging from pixel-level to patch-level and dataset-level, of either downstream or pre-training data. Our tests unveil intriguing robustness behaviors of contrastive and supervised learning. On one hand, under downstream corruptions, contrastive learning is surprisingly more robust than supervised learning. On the other hand, under pre-training corruptions, contrastive learning is vulnerable to patch shuffling and pixel intensity change, yet less sensitive to dataset-level distribution change. We analyze these results through the role of data augmentation and feature properties which have implications on improving supervised pre-training's downstream robustness. Link » Yuanyi Zhong · Haoran Tang · Junkun Chen · Jian Peng · Yu-Xiong Wang 🔗 - Leader-based Pre-training Framework for Cooperative Multi-Agent Reinforcement Learning (Poster)  link » A leader in the team enables efficient learning for other novices in the social learning setting for both humans and animals. This paper constructs the leader-based pre-training framework for Multi-Agent Reinforcement Learning and investigates whether the leader enables the learning of novices as well. We compare three different approaches to distilling a leader's experiences from the pre-training model: Linear Layer Dimension Reduction, Attentive Graph Pooling, and Attention-based Graph Neural Network. We successfully show that a leader-based pre-training framework can 1) enable agents to learn faster, cooperate more effectively, and escape local optimum, and 2) promote the generalizability of agents in more challenging and unseen environments. The key to effective distillation is to maintain and aggregate important information. Link » Wenqi Chen · Xin Zeng · Amber Li 🔗 - Pixel-level Correspondence for Self-Supervised Learning from Video (Poster)  link » While self-supervised learning has enabled effective representation learning in the absence of labels, for vision, video remains a relatively untapped source of supervision. To address this, we propose Pixel-level Correspondence (PiCo), a method for dense contrastive learning from video. By tracking points with optical flow, we obtain a correspondence map which can be used to match local features at different points in time. We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks, without compromising performance on image classification. Link » Yash Sharma · Yi Zhu · Chris Russell · Thomas Brox 🔗 - Pre-Training on a Data Diet: Identifying Sufficient Examples for Early Training (Poster)  link » A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that—after just a few hundred steps of dense training—the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e., random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP through the lens of the data distribution. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Combined, these results provide new insight into the role played by data in the early phase of training. Link » Mansheej Paul · Brett Larsen · Surya Ganguli · Jonathan Frankle · Gintare Karolina Dziugaite 🔗 - Enhancing Multi-hop Connectivity for Graph Convolutional Networks (Poster)  link » Graph Convolutional Network and many of its variants are known to suffer from the dilemma between model depth and over-smoothing issues. Stacking layers of GCN usually lead to the exponential expansion of the receptive field (i.e., high-order neighbors). In order to incorporate the information from high-order neighbors to learn node representations without drastically increasing the number of graph convolution layers, we propose a simple and effective pre-processing technique to increase graph connectivity. Our approach selectively inserts connections between center nodes and informative high-order neighbors, with learnable weights to control the information flow through the connection. Experiments show that our approach improves the performance of GCN, and reduce the depth of GCNII without sacrificing its performance. Besides, our proposed homophily-based weight assignment can mitigate the effect of graph structural attacks. Link » Songtao Liu · Shixiong Jing · Tong Zhao · Zengfeng Huang · Dinghao Wu 🔗 - Investigating Why Contrastive Learning Benefits Robustness against Label Noise (Poster)  link » Self-supervised contrastive learning has recently been shown to be very effective in preventing deep networks from overfitting noisy labels. Despite its empirical success, the theoretical understanding of the effect of contrastive learning on boosting robustness is very limited. In this work, we rigorously prove that learned the representation matrix has certain desirable properties in terms its SVD that benefit robustness against label noise. We further show that the low-rank structure of the Jacobian of deep networks pre-trained with contrastive learning allows them to achieve a superior performance initially, when fine-tuned on noisy labels. Finally, we demonstrate that the initial robustness provided by contrastive learning enables robust training methods to achieve state-of-the-art performance under extreme noise levels. Link » Yihao Xue · Kyle Whitecross · Baharan Mirzasoleiman 🔗 - Pretraining a Neural Network before Knowing Its Architecture (Poster)  link » Training large neural networks is possible by training a smaller hypernetwork that predicts parameters for the large ones. A recently released Graph HyperNetwork (GHN) trained this way on one million of smaller ImageNet architectures is able to predict parameters for large unseen networks such as ResNet-50. While networks with predicted parameters lose performance on the source task, the predicted parameters have been found useful for fine-tuning on other tasks. We study if fine-tuning based on the same GHN is still useful on novel strong architectures that were published after the GHN had been trained. We found that for recent architectures such as ConvNeXt, GHN initialization becomes less useful than for ResNet-50. One potential reason is the increased distribution shift of novel architectures from those used to train the GHN. We also found that the predicted parameters lack the diversity necessary to successfully fine-tune parameters with gradient descent. We alleviate this limitation by applying simple post-processing techniques to predicted parameters before fine-tuning them on a target task and improve fine-tuning of ResNet-50 and ConvNeXt. Link » Boris Knyazev 🔗 - Improved Logical Reasoning of Language Models via Differentiable Symbolic Programming (Poster)  link » Pre-trained large language models (LMs) struggle to perform logical reasoning reliably despite advances in scale and compositionality. In this work, we tackle this challenge through the lens of symbolic programming. We propose DSR-LM, a Differentiable Symbolic Reasoning framework where pre-trained LMs govern the perception of factual knowledge, and a symbolic module equipped with provenance generates top-k proofs by deductive reasoning. In contrast to works that rely on hand-crafted logic rules, our differentiable symbolic reasoning architecture efficiently learns weighted rules to further improve LMs. DSR-LM is scalable, interpretable, and allows easy integration of prior knowledge, thereby supporting extensive symbolic programming to robustly derive a logical conclusion. Our experiments show that DSR-LM leads to improved logical reasoning of pre-trained LMs and outperforms a spectrum of competitive baselines even under systematic distribution shifts on sequence lengths. Link » Hanlin Zhang · Ziyang Li · Jiani Huang · Mayur Naik · Eric Xing 🔗 - Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Prior (Poster)  link » Deep learning is increasingly moving towards a transfer learning paradigm whereby large foundation models'' are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task. %, and would not affect the final solution at all if we do a good job of optimization. Instead, we show that we can learn highly informative posteriors from the source task, which serves as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on various downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies. Link » Ravid Shwartz-Ziv · Micah Goldblum · Hossein Souri · Sanyam Kapoor · Chen Zhu · Yann LeCun · Andrew Wilson 🔗 - How well do contrastively trained models transfer? (Poster)  link » There are two prevailing methods for pre-training on large datasets to learn transferable representations: 1) supervised pre-training on large but weakly-labeled datasets; 2) contrastively training on image only and image, text pairs. While supervised pre-training learns good representations that can be transferred to a wide range of tasks, contrastively models such as CLIP have demonstrated unprecedented zero-shot transfer. In this work, we compare the transferability of the two aforementioned methods to multiple downstream tasks. The pre-training distributions we consider include YFCC, Conceptual Captions, and ImageNet-21K while pre-training objectives range from supervised to SimCLR, CLIP, and SLIP. We observe that different pre-training methods with the same training source transfer similarly given their ImageNet accuracy. Link » M. Moein Shariatnia · Rahim Entezari · Mitchell Wortsman · Olga Saukh · Ludwig Schmidt 🔗 - Vote for Nearest Neighbors Meta-Pruning of Self-Supervised Networks (Poster)  link » Pruning plays an essential role in deploying deep neural nets (DNNs) to the hardware of limited memory or computation. However, current high-quality iterative pruning can create a terrible carbon footprint when compressing a large DNN for a wide variety of devices and tasks. Can we reuse the pruning results on previous tasks to accelerate the pruning for a new task? Can we find a better initialization for a new task? We study this nearest neighbors meta-pruning'' problem by first investigating different choices of pre-trained models for pruning under limited iterations. Our empirical study reveals several advantages of the self-supervision pre-trained model when pruned for multiple tasks. We further study the overlap of pruned models for similar tasks and how the overlap changes for different layers. Inspired by these discoveries, we develop a simple but strong baselineMeta-Vote Pruning (MVP)'' that significantly reduces the pruning iterations for a new task by initializing a sub-network from the pruned models of tasks similar to it. In experiments, we demonstrate the advantages of MVP through extensive empirical studies and comparisons with popular pruning methods. Link » Haiyan Zhao · Tianyi Zhou · Guodong Long · Jing Jiang · Chengqi Zhang 🔗 - On Combining Global and Localized Self-Supervised Models of Speech (Poster)  link » Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks.In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective -- such as wav2vec-2.0 and HuBERT -- induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL -- a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods. Link » Sri Harsha Dumpala · Chandramouli Shama Sastry · Rudolf Uher · Sageev Oore 🔗 - Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning (Poster)  link » We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representation in multi-modal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness. Our code and data are available at https://modalitygap.readthedocs.io/ Link » Weixin Liang · Yuhui Zhang · Yongchan Kwon · Serena Yeung · James Zou 🔗 - Robustness to Adversarial Gradients: A Glimpse Into the Loss Landscape of Contrastive Pre-training (Poster)  link » An in-depth understanding of deep neural network generalization can allow machine learning practitioners to design systems more robust to class balance shift, adversarial attacks, and data drift. However, the reasons for better generalization are not fully understood. Recent works provide empirical arguments suggesting flat minima generalize better. While recently proposed contrastive pre-training methods have also been shown to improve generalization, there is an incomplete understanding of the loss landscape of these models and why they generalize well. In this work, we analyze the loss landscape of contrastive trained models on the CIFAR10 dataset by looking at three sharpness measures: (1) the approximate eigenspectrum of the Hessian, (2) (Cε, A)-sharpness, and (3) robustness to adversarial gradients (RAG), a new efficient measure of sharpness. Our findings suggest models fine-tuned after contrastive training favor flatter solutions relative to baseline classifiers trained with a supervised objective. In addition, our proposed metric yields findings consistent with existing works, demonstrating impacts of learning rate and batch size on minima sharpness. Link » Philip Fradkin · Lazar Atanackovic · Michael Zhang 🔗 - Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models (Poster)  link » A growing ecosystem of large, open-source foundation models has reduced the labeled data and technical expertise necessary to apply machine learning to many new problems. Yet foundation models pose a clear dual-use risk, indiscriminately reducing the costs of building both harmful and benign machine learning systems. To mitigate this risk, we propose the task blocking paradigm, in which foundation models are trained with an additional mechanism to impede adaptation to harmful tasks while retaining good performance on desired tasks. We call the resulting models self-destructing models, inspired by mechanisms that prevent adversaries from using tools for harmful purposes. We present an algorithm for training self-destructing models leveraging techniques from meta-learning and adversarial learning, showing that it can largely prevent a BERT-based model from learning to perform gender identification without harming the model's ability to perform profession classification. We conclude with a discussion of future directions. Link » Eric Mitchell · Peter Henderson · Christopher Manning · Dan Jurafsky · Chelsea Finn 🔗 - Flaky Performances when Pre-Training on Relational Databases with a Plan for Future Characterization Efforts (Poster)  link » We explore the downstream task performances for graph neural network (GNN) self-supervised learning (SSL) methods trained on subgraphs extracted from relational databases (RDBs). Intuitively, this joint use of SSL and GNNs allows us to leverage more of the available data, which could translate to better results. However, while we observe positive transfer in some cases, others showed systematic performance degradation, including some spectacular ones. We hypothesize a mechanism that could explain this behaviour and draft the plan for future work testing it by characterize how much relevant information different strategies can (theoretically and/or empirically) extract from (synthetic and/or real) RDBs. Link » Shengchao Liu · David Vazquez · Jian Tang · Pierre-André Noël 🔗 - Training strategies with unlabeled and few labeled examples under 1-pixel attack by combining supervised and self-supervised learning (Poster)  link » Self-supervised learning pre-training exhibited excellent performance on feature learning by using only unlabeled examples. Still, it is not clear how different self-supervised tasks perform under distinct image domains and there are still training issues to be tackled under scenarios of limited labeled data. We investigate two self-supervised tasks: rotation and Barlow Twins, on three distinct image domains, exploring a combination of supervised and self-supervised learning. Our motivation is to work on scenarios where the proportion of labeled data with respect to unlabeled data is small, as well as investigate the model's robustness to 1-pixel attacks. The models that combine supervised with self-supervised tasks can take advantage of the unlabeled data to improve the learned representation in terms of the linear discrimination, as well as allowing learning even under attack. Link » Gabriel Biscaro Cavallari · Moacir Ponti 🔗 - Plex: Towards Reliability using Pretrained Large Model Extensions (Poster)  link » A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 10 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions (plex) for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across tasks, and simplifies the traditional protocol as it does not require designing scores or tuning the model for each individual task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on challenging tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. Link » Dustin Tran · Andreas Kirsch · Balaji Lakshminarayanan · Huiyi Hu · Du Phan · D. Sculley · Jasper Snoek · Jeremiah Liu · JIE REN · Joost van Amersfoort · Kehang Han · Estefany Kelly Buchanan · Kevin Murphy · Mark Collier · Michael Dusenberry · Neil Band · Nithum Thain · Rodolphe Jenatton · Tim G. J Rudner · Yarin Gal · Zachary Nado · Zelda Mariet · Zi Wang · Zoubin Ghahramani 🔗 - Contrastive Learning Can Find An Optimal Basis For Approximately Invariant Functions (Poster)  link » Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpeted as learning a positive-definite kernel that approximates a particular contrastive kernel defined by the positive pairs. The principal components of the data under this kernel exactly correspond to the eigenfunctions of a positive-pair Markov chain, and these eigenfunctions can be used to build a representation thatprovably minimizes the worst-case approximation error of linear predictors under the assumption that positive pairs have similar labels. We give generalization bounds for downstream linear prediction using this optimal representation, and show how to approximate this representation using kernel PCA. We also explore kernel-based representations on a noisy MNIST task for which the positive pair distribution has a closed form, and compare the properties of the true eigenfunctions with their learned approximations. Link » Daniel D. Johnson · Daniel D. Johnson · Ayoub El Hanchi · Ayoub El Hanchi · Chris Maddison · Chris Maddison 🔗 - Memorization in NLP Fine-tuning Methods (Poster)  link » Large language models are shown to present privacy risks through memorization of training data, and several recent works have studied such risks for the pre-training phase. Little attention, however, has been given to the fine-tuning phase and it is not well understood how different fine-tuning methods (such as fine-tuning the full model, the model head, and adapter) compare in terms of memorization risk. This presents increasing concern as the pre-train and fine-tune'' paradigm proliferates. In this paper, we empirically study memorization of fine-tuning methods using membership inference and extraction attacks, and show that their susceptibility to attacks is very different. We observe that fine-tuning the head of the model has the highest susceptibility to attacks, whereas fine-tuning smaller adapters appears to be less vulnerable to known extraction attacks. Link » FatemehSadat Mireshghallah · FatemehSadat Mireshghallah · Archit Uniyal · Archit Uniyal · Tianhao Wang · Tianhao Wang · David Evans · David Evans · Taylor Berg-Kirkpatrick · Taylor Berg-Kirkpatrick 🔗 - Feed-Forward Source-Free Latent Domain Adaptation via Cross-Attention (Poster)  link » We study the highly practical but comparatively under-studied problem of latent-domain adaptation, where a source model should be adapted to a target dataset that contains a mixture of unlabelled domain-relevant and domain-irrelevant examples. Motivated by the requirements for data privacy and the need for embedded and resource-constrained devices of all kinds to adapt to local data distributions, we further focus on the setting of feed-forward source-free domain adaptation, where adaptation should not require access to the source dataset, and also be back propagation-free. Our solution is to meta-learn a network capable of embedding the mixed-relevance target dataset and dynamically adapting inference for target examples using cross-attention. The resulting framework leads to consistent strong improvements. Link » Ondrej Bohdal · Da Li · Xu Hu · Timothy Hospedales 🔗 - On the Subspace Structure of Gradient-Based Meta-Learning (Poster)  link » In this work we provide an analysis of the distribution of the post-adaptation parameters of Gradient-Based Meta-Learning (GBML) methods. Previous work has noticed how, for the case of image-classification, this adaption only takes place on the last layers of the network. We propose the more general notion that parameters are updated over a low-dimensional subspace of the same dimensionality as the task-space and show that this holds for regression as well. Furthermore, the induced subspace structure provides a method to estimate the intrinsic dimension of the space of tasks of common few-shot learning datasets. Link » Gustaf Tegnér · Alfredo Reichlin · Hang Yin · Mårten Björkman · Danica Kragic 🔗 - Reinforcement Learning Assisted Layer-wise Fine-Tuning for Transfer Learning (Poster)  link » Data scarcity is one of the major challenges in many real-world applications. To handle low-data regimes, practitioners often take an existing pre-trained network and fine-tune it on a data-deficient target task. In this setup, a network is pre-trained on a source dataset and fine-tuned on a different, potentially smaller, target dataset. We address two critical challenges with transfer learning via fine-tuning: (1) The required amount of fine-tuning greatly depends on the distribution shift from source to target dataset. (2) Layer-wise adjustments allow for the model to adapt to this distribution shift while also preserving the pre-trained network’s feature extractor. To overcome the challenges, we propose RL-Tune, a layer-wise fine-tuning framework for transfer learning which leverages reinforcement learning to adjust learning rates as a function of the target data shift. In our RL framework, the state is a collection of the intermediate feature activations generated with training samples. To accommodate different abstraction levels of layers, the agent generates layer-wise learning rates as actions for fine-tuning based on the current state and obtains the sample accuracy as a reward. RL-Tune outperforms other state-of-the-art approaches on standard transfer learning benchmarks by a large margin, e.g., 6.2% mean accuracy improvement on CUBS-200-2011 with 15% data. Link » Tanvir Mahmud · Natalia Frumkin · Diana Marculescu 🔗 - Improved Generalization Bounds for Transfer Learning via Neural Collapse (Poster)  link » Using representations learned by large, pretrained models, also called foundation models, in new tasks with fewer data has been successful in a wide range of machine learning problems. Recently, Galanti et al. (2022) introduced a theoretical framework for studying this transfer learning setting for classification. Their analysis is based on the recently observed phenomenon that the features learned by overparameterized deep classification networks show an interesting clustering property, called neural collapse (Papyan et al., 2020). A cornerstone of their analysis demonstrates that neural collapse generalizes from the source classes to new target classes. However, this analysis is limited as it relies on several unrealistic assumptions. In this work, we provide an improved theoretical analysis significantly relaxing these modeling assumptions. Link » Tomer Galanti · András György · Marcus Hutter 🔗 - Predicting Human Similarity Judgments Using Large Language Models (Poster)  link » Similarity judgments provide a well-established method for accessing mental representations, with applications in psychology, neuroscience and machine learning. However, collecting similarity judgments can be prohibitively expensive for naturalistic datasets as the number of comparisons grows quadratically in the number of stimuli. We leverage recent advances in language models and online recruitment, proposing an efficient domain-general procedure for predicting human similarity judgments based on text descriptions. Crucially, the number of descriptions required grows only linearly with the number of stimuli, drastically reducing the amount of data required. We test this procedure on six datasets of naturalistic images and show that our models outperform previous approaches based on visual information. Link » Raja Marjieh · Ilia Sucholutsky · Theodore R Sumers · Nori Jacoby · Thomas Griffiths · Thomas Griffiths 🔗 - Federated Learning from Pre-Trained Models: A Contrastive Learning Approach (Poster)  link » Excessive computation and communication demands pose challenges to current FL frameworks, especially when training large-scale models. To prevent these issues from hindering the deployment of FL systems, we propose a lightweight framework where clients jointly learn to fuse the representations generated by multiple fixed pre-trained models rather than training a large-scale model from scratch. To capture more client-specific and class-relevant information from the pre-trained models and jointly improve each client's ability to exploit those off-the-shelf models, we design a Federated Prototype-wise Contrastive Learning (FedPCL) approach which shares knowledge across clients through their class prototypes and builds client-specific representations in a prototype-wise contrastive manner. We perform a thorough evaluation of the proposed FedPCL in the lightweight framework, measuring its ability to fuse various pre-trained models on popular FL datasets. Link » Yue Tan · Yue Tan · Guodong Long · Guodong Long · Jie Ma · Jie Ma · LU LIU · LU LIU · Tianyi Zhou · Tianyi Zhou · Jing Jiang · Jing Jiang 🔗 - Similarity of Pre-trained and Fine-tuned Representations (Poster)  link » Most often, in transfer learning, only the last part of the networks - the so-called head - is fine-tuned. Representation similarity analysis shows that the most significant change still occurs in the head even if all weights are updatable. However, recent results from few-shot learning have shown that, especially in the case of cross-domain adaption, representation change in the early layers, which are mostly convolutional, is beneficial. In our paper, we find out if that also holds true for transfer learning. In addition, we analyze the change of representation in transfer learning, both during pre-training and fine-tuning and find out that pre-trained structure is unlearned if not usable. Link » Thomas Goerttler · Thomas Goerttler · Klaus Obermayer 🔗 - Hyper-Representation for Pre-Training and Transfer Learning (Poster)  link » Learning representations of neural network weights given a model zoo is an emerging and challenging area with many potential applications from model inspection, to neural architecture search or knowledge distillation. Recently, an autoencoder trained on a model zoo was able to learn a hyper-representation, which captures intrinsic and extrinsic properties of the models in the zoo. In this work, we extend hyper-representations for generative use to sample new model weights as pre-training. We propose layer-wise loss normalization which we demonstrate is key to generate high-performing models and a sampling method based on the empirical density of hyper-representations. The models generated using our methods are diverse, performant and capable to outperform conventional baselines for transfer learning. Our results indicate the potential of knowledge aggregation from model zoos to new models via hyper-representations thereby paving the avenue for novel research directions. Link » Konstantin Schürholt · Konstantin Schürholt · Boris Knyazev · Boris Knyazev · Xavier Giro-i-Nieto · Damian Borth · Damian Borth 🔗 - What Do We Maximize In Self-Supervised Learning? (Poster)  link » This paper analyses self-supervised learning (SSL) methods, VICReg in particular, to provide for the first time an information-theoretical understanding of its construction. As a first step, we demonstrate how information-theoretic quantities can be obtained for a deterministic network, offering a possible alternative to prior works that rely on stochastic models. This enables us to demonstrate how VICReg can be (re-)discovered from first principles and its assumptions about data distribution. Furthermore, we demonstrated the validity of our assumptions empirically, confirming our novel understanding of VICReg. Finally, we believe that the derivation and insights we obtain can be generalized to many other SSL methods, opening new avenues for theoretical and practical understanding of SSL and transfer learning. Link » Ravid Shwartz-Ziv · Ravid Shwartz-Ziv · Randall Balestriero · Yann LeCun · Yann LeCun 🔗 - ECLIP: Efficient Contrastive Language-Image Pretraining via Ensemble Confidence Learning and Masked Language Modeling (Poster)  link » While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces three challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose \textbf{E}fficient \textbf{C}ontrastive \textbf{L}anguage-\textbf{I}mage \textbf{P}retraining (ECLIP) via Ensemble Confidence Learning and Masked Language Modeling. Specifically, We adaptively filter out noisy samples in the training process by means of Ensemble Confidence Learning strategy, and add a Masked Language Modeling objective to utilize extra non-paired text data. ECLIP achieves the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared with CLIP and WenLan, while showing excellent generalization to single-modal tasks including text retrieval and text classification. Link » Jue Wang · Jue Wang · Haofan Wang · Haofan Wang · Weijia Wu · Weijia Wu · Jincan Deng · Jincan Deng · Yu Lu · Xiaofeng Guo · Xiaofeng Guo · Debing Zhang · Debing Zhang 🔗 - Boosting Monolingual Sentence Representation with Large-scale Parallel Translation Datasets (Poster)  link » Although contrastive learning greatly improves sentence representation, its performance is still limited by the size of existing monolingual datasets. So can semantically highly correlated massively parallel translation pairs be used for pre-training of monolingual models? This paper proposes an exploration of this. We leverage parallel translated sentence pairs to learn single-sentence sentence embeddings and demonstrate superior performance in balancing alignment and consistency. We achieve new state-of-the-art performance on the mean score of Standard Semantic Text Similarity (STS), outperforming both SimCSE and Sentence-T5. Link » Jue Wang · Jue Wang · Haofan Wang · Haofan Wang · Xing Wu · Xing Wu · Chaochen Gao · Chaochen Gao · Debing Zhang 🔗 - Knowledge Distillation for Efficient Sequences of Training Runs (Poster)  link » In many practical scenarios - like hyperparameter search or continual retraining with new data - related training runs are performed many times in sequence. Current practice is to train each of these models independently from scratch. We study the problem of exploiting the computation invested in previous runs to reduce the cost of future runs using knowledge distillation (KD). We find that augmenting future runs with KD from previous runs dramatically reduces the time necessary to train these models, even taking into account the overhead of KD. We improve on these results with two strategies that reduce the overhead of KD by 80-90% with minimal effect on accuracy and vast pareto-improvements in overall cost. We conclude that KD is a promising avenue for reducing the cost of the expensive preparatory work that precedes training final models in practice. Link » Xingyu Liu · Xingyu Liu · Alexander Leonardi · Alexander Leonardi · Lu Yu · Lu Yu · Christopher Gilmer-Hill · Christopher Gilmer-Hill · Matthew Leavitt · Matthew Leavitt · Jonathan Frankle · Jonathan Frankle 🔗 - Energy-Inspired Self-Supervised Pretraining for Vision Models (Poster)  link » Motivated by the fact that forward and backward passes of a deep network naturally form symmetric mappings between input and output representations, we introduce a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs). In the proposed framework, we model energy estimation and data restoration as the forward and backward passes of a single network without any auxiliary components, e.g., an extra decoder. For the forward pass, we fit a network to an energy function that assigns low energy scores to samples that belong to an unlabeled dataset, and high energy otherwise. For the backward pass, we restore data from corrupted versions iteratively using gradient-based optimization along the direction of energy minimization in as few as one step. Our framework accepts a wide range of pretext tasks with different data corruption methods, and permits models to be pretrained from masked image modeling and image restoration. We support our findings with extensive experiments, and show the proposed method delivers comparable and even better performance with remarkably fewer epochs of training compared to the state-of-the-art self-supervised vision model pretraining methods. Our findings shed light on further exploring self-supervised vision model pretraining pretext tasks beyond masked image modeling. Link » Ze Wang · Ze Wang · Jiang Wang · Jiang Wang · Zicheng Liu · Zicheng Liu · Qiang Qiu · Qiang Qiu 🔗 - On the Connection between Pre-training Data Diversity and Robustness (Poster)  link » Our work studies the implications of transfer learning on model behavior beyond accuracy: how does the pre-training distribution affect the downstream robustness of a fine-tuned model? We analyze model robustness using the framework proposed by Taori et al. (2020), which demonstrates that in-distribution and out-of-distribution performances are highly correlated along a robustness linear trend. We explore various interventions that significantly alter the pre-training distribution, including label space, label semantics, and the pre-training dataset itself. In most cases, changes during pre-training have minimal impact on the original linear trend produced by models pre-trained on the full ImageNet dataset. We demonstrate these findings on pre-training distributions constructed from ImageNet and iNaturalist, and fine-tuning data from the iWildCams-WILDS benchmark. Link » Vivek Ramanujan · Vivek Ramanujan · Thao Nguyen · Thao Nguyen · Ludwig Schmidt · Ali Farhadi · Ali Farhadi 🔗 - Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning (Poster)  link » Learning generalizable policies that can adapt to unseen environments remains challenging in visual Reinforcement Learning (RL). Existing approaches try to acquire a robust representation via diversifying the appearances of in-domain observations for better generalization. Limited by the specific observations of the environment, these methods ignore the possibility of exploring diverse real-world image datasets. In this paper, we investigate how a visual RL agent would benefit from the off-the-shelf visual representations. Surprisingly, we find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL. Hence, we propose Pre-trained Image Encoder for Generalizable visual reinforcement learning (PIE-G), a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner. Extensive experiments are conducted on DMControl Generalization Benchmark, DMControl Manipulation Tasks, and Drawer World to verify the effectiveness of PIE-G. Empirical evidence suggests PIE-G can significantly outperforms previous state-of-the-art methods in terms of generalization performance. In particular, PIE-G boasts a 55% generalization performance gain on average in the challenging video background setting. Link » Zhecheng Yuan · Zhecheng Yuan · Zhengrong Xue · Zhengrong Xue · Bo Yuan · Bo Yuan · Xueqian Wang · Xueqian Wang · Yi Wu · Yi Wu · Yang Gao · Yang Gao · Huazhe Xu · Huazhe Xu 🔗 - Self-Supervised Time Series Representation Learning with Temporal-Instance Similarity Distillation (Poster)  link » We propose a self-supervised method for pre-training universal time series representations in which we learn contrastive representations using similarity distillation along the temporal and instance dimensions. We analyze the effectiveness of both dimensions, and evaluate our pre-trained representations on three downstream tasks: time series classification, anomaly detection, and forecasting. Link » Ainaz Hajimoradlou · Ainaz Hajimoradlou · Leila Pishdad · Leila Pishdad · Frederick Tung · Frederick Tung · Maryna Karpusha · Maryna Karpusha 🔗 - Protein Representation Learning by Geometric Structure Pretraining (Poster)  link » Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less data. All codes and models will be published upon acceptance. Link » Zuobai Zhang · Zuobai Zhang · Minghao Xu · Minghao Xu · Arian Jamasb · Arian Jamasb · Vijil Chenthamarakshan · Vijil Chenthamarakshan · Aurelie Lozano · Payel Das · Payel Das · Jian Tang · Jian Tang 🔗 - Manifold Characteristics That Predict Downstream Task Performance (Poster)  link » Pretraining methods are typically compared by evaluating the accuracy of linear classifiers, transfer learning performance, or visually inspecting the representation manifold's (RM) lower-dimensional projections. We show that the differences between methods can be understood more clearly by investigating the RM directly, which allows for a more detailed comparison. To this end, we propose a framework and new metric to measure and compare different RMs. We also investigate and report on the RM characteristics for various pretraining methods. These characteristics are measured by applying sequentially larger local alterations to the input data, using white noise injections and Projected Gradient Descent (PGD) adversarial attacks, and then tracking each datapoint. We calculate the total distance moved for each datapoint and the relative change in distance between successive alterations. We show that self-supervised methods learn an RM where alterations lead to large but constant size changes, indicating a smoother RM than fully supervised methods. We then combine these measurements into one metric, the Representation Manifold Quality Metric (RMQM), where larger values indicate larger and less variable step sizes, and show that RMQM correlates positively with performance on downstream tasks. Link » Ruan van der Merwe · Ruan van der Merwe · Gregory Newman · Gregory Newman · Etienne Barnard · Etienne Barnard 🔗 - PARS-Push: Personalized, Asynchronous and Robust Decentralized Optimization (Poster)  link » We study the decentralized multi-step Model-Agnostic Meta-Learning (MAML) framework where a group of $n$ agents seeks to find a common point that enables few-shot'' learning (personalization) via local stochastic gradient steps on their local functions. We formulate the personalized optimization problem under the MAML framework and propose PARS-Push, a decentralized asynchronous algorithm robust to message failures, communication delays, and directed message sharing. We characterize the convergence rate of PARS-Push for smooth and strongly convex and smooth and non-convex functions under arbitrary multi-step personalization. Moreover, we provide numerical experiments showing its performance under heterogeneous data setups. Link » Mohammad Taha Toghani · Mohammad Taha Toghani · Soomin Lee · Soomin Lee · Cesar Uribe 🔗 - Evaluating Self-Supervised Learned Molecular Graphs (Poster)  link » Because of data scarcity in real-world scenarios, obtaining pre-trained representations via self-supervised learning (SSL) has attracted increasing interest. Although various methods have been proposed, it is still under-explored what knowledge the networks learn from the pre-training tasks and how it relates to downstream properties. In this work, with an emphasis on chemical molecular graphs, we fill in this gap by devising a range of node-level, pair-level, and graph-level probe tasks to analyse the representations from pre-trained graph neural networks (GNNs). We empirically show that: 1. Pre-trained models have better downstream performance compared to randomly-initialised models due to their improved the capability of capturing global topology and recognising substructures. 2. However, randomly initialised models outperform pre-trained models in terms of retaining local topology. Such information gradually disappears from the early layers to the last layers for pre-trained models. Link » Hanchen Wang · Hanchen Wang · Shengchao Liu · Shengchao Liu · Jean Kaddour · Jean Kaddour · Qi Liu · Qi Liu · Jian Tang · Jian Tang · Matt Kusner · Matt Kusner · Joan Lasenby · Joan Lasenby 🔗 - PSP-HDRI+: A Synthetic Dataset Generator for Pre-Training of Human-Centric Computer Vision Models (Poster)  link » We introduce a new synthetic data generator PSP-HDRI$+$ that proves to be a superior pre-training alternative to ImageNet and other large-scale synthetic data counterparts. We demonstrate that pre-training with our synthetic data will yield a more general model that performs better than alternatives even when tested on out-of-distribution (OOD) sets. Furthermore, we show how our synthetic data generator has the potential for improvement by employing basic ablation strategies. Link » Salehe Erfanian Ebadi · Salehe Erfanian Ebadi · Saurav Dhakad · Saurav Dhakad · Sanjay Vishwakarma · Sanjay Vishwakarma · Chunpu Wang · Chunpu Wang · You-Cyuan Jhang · Maciek Chociej · Maciek Chociej · Adam Crespi · Adam Crespi · Alex Thaman · Alex Thaman · Sujoy Ganguly · Sujoy Ganguly 🔗 - Generative Self-training Improves Pre-training for Visual Dialog (Poster)  link » Visual dialog (VisDial) is a task of answering a series of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog models solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for VisDial, called \textit{Generative Self-Training} (GST), to enhance the pre-training. Specifically, GST generates synthetic dialog data for unlabeled images via multimodal conditional text generation and trains the dialog model on the synthetic and the original VisDial data. Moreover, we also propose perplexity-based data selection and multimodal consistency regularization for robust training of the synthetic data. Evaluation on VisDial v1.0 dataset shows that GST improves the pre-training and achieves new state-of-the-art results. Link » Gi-Cheon Kang · Gi-Cheon Kang · Sungdong Kim · Sungdong Kim · Jin-Hwa Kim · Jin-Hwa Kim · Donghyun Kwak · Donghyun Kwak · Byoung-Tak Zhang · Byoung-Tak Zhang 🔗 - The Trade-off between Label Efficiency and Universality of Representations from Contrastive Learning (Poster)  link » The pre-train representation learning paradigm is a recent popular approach to address distribution shift and dataset limitation. This approach first pre-trains a representation function using large unlabeled datasets by self-supervised learning (e.g., contrastive learning), and then learns a classifier on the representation using small labeled datasets for downstream target tasks. The representation should have two key properties: label efficiency (i.e., learning an accurate classifier with a small amount of labeled data) and universality (i.e., useful for a wide range of downstream tasks). In this paper, we focus on contrastive learning and systematically study a trade-off between label efficiency and universality both empirically and theoretically. We empirically show that the trade-off exists in different models and datasets. Theoretically, we propose a data model with hidden representation and provide analysis in a simplified setting with linear models. The analysis shows that compared with pre-training on the target data directly, pre-training on diverse tasks can lead to a larger sample complexity for learning the classifier and thus worse prediction performance. Link » Zhenmei Shi · Zhenmei Shi · Jiefeng Chen · Jiefeng Chen · Kunyang Li · Kunyang Li · Jayaram Raghuram · Jayaram Raghuram · Xi Wu · Xi Wu · Yingyiu Liang · Yingyiu Liang · Somesh Jha · Somesh Jha 🔗 - Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision (Poster)  link » Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is because researchers do not choose consistent training recipes and even use different data, hampering the fair comparison between different methods. In this work, we propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. We conduct a comprehensive analysis of three key factors: data, supervision, and model architecture. We find considerable intuitive or counter-intuitive insights: (1). Data quality has a significant impact on performance. (2). Certain supervision has different effects for Convolutional Networks (ConvNets) and Vision Transformers (ViT). Applying more proper supervision can effectively improve the performance of CLIP. (3). Curtailing the text encoder reduces the training cost but not much affect the final performance. Moreover, we further combine DeCLIP with FILIP, bringing us the strongest variant DeFILIP. The CLIPbenchmark is released at: https://github.com/Sense-GVT/DeCLIP for future CLIP research. Link » Yufeng Cui · Yufeng Cui · Lichen Zhao · Lichen Zhao · Feng Liang · Feng Liang · Yangguang Li · Yangguang Li · Jing Shao · Jing Shao 🔗 - LAVA: Language Audio Vision Alignment for Pre-Training Transformers on Video Data (Poster)  link » Generating representations of video data is of key importance in advancing the field of machine perception. Most current techniques rely on hand-annotated data, which can be difficult to work with, expensive to generate, and hard to scale. In this work, we propose a novel learning approach based on contrastive learning, LAVA, which is capable of learning joint language, audio, and video representations in a self-supervised manner. We pre-train LAVA on the Kinetics 700 dataset using transformer encoders to learn representations for each modality. We then demonstrate that LAVA performs competitively with the current state-of-the-art self-supervised and weakly-supervised techniques on UCF-101 and HMDB-51 video action recognition while using a fraction of the unlabeled data. Link » Sumanth Gurram · Sumanth Gurram · David Chan · David Chan · Andy Fang · Andy Fang · John Canny · John Canny 🔗