Deep networks have shown outstanding scaling properties both in terms of data and model sizes: larger does better. Unfortunately, the computational cost of current state-of-the-art methods is prohibitive. A number of new techniques have recently arisen to address and improve this fundamental quality-cost trade-off. For instance, methods like conditional computation, adaptive computation, dynamic model sparsification, and early-exit approaches are all aimed at addressing the above-mentioned quality-cost trade-off. This workshop explores such exciting and practically-relevant research avenues.More specifically, as part of contributed content we will invite high-quality papers on the following topics: dynamic routing, mixture-of-experts models, early-exit methods, conditional computations, capsules and object-oriented learning, reusable components, online network growing and pruning, online neural architecture search and applications of dynamic networks (continual learning, wireless/embedded devices and similar).The workshop is planned as a whole day event and will feature 2 keynote talks, a mix of panel discussion, contributed and invited talks, and a poster session. The invited speakers cover a diverse range of research fields (machine learning, computer vision, neuroscience, natural language processing) and backgrounds (academic, industry) and include speakers from underrepresented groups. All speakers confirmed their talks and the list ranges from senior faculty members (Gao Huang, Tinne Tuytelaars) to applied and theoretical research scientists (Weinan Sun, Francesco Locatello). The workshop builds on a set of previous workshops previously run at prime venues, such as CVPR, NeurIPS and ICLR.
Fri 6:00 a.m. - 6:15 a.m.
|
Welcome note
(
Welcome note by workshop organizers
)
|
marco levorato · Bradley McDanel 🔗 |
Fri 6:15 a.m. - 7:00 a.m.
|
Spatially and Temporally Adaptive Neural Networks
(
Virtual Keynote
)
SlidesLive Video » Discriminative features in an image or video usually correspond to only a subset of pixels or frames, while the remaining regions/intervals are less relevant to the task at hand. The prevalent deep learning approaches in computer vision, e.g., CNNs and Vision Transformers, are static models, which generally allocate an equal amount of computation to all the pixels/frames, leading to considerable redundancy. This talk will introduce dynamic neural networks that can spend computation unevenly both spatially and temporally. Two notable challenges in developing such models are that 1) the optimization becomes nondifferentiable; and 2) the inference stage may involve sparse computation which is practically inefficient. The talk will present effective and efficient approaches for developing spatially and temporally adaptive networks, and show their excellent performance on image and video recognition benchmarks. |
Gao Huang 🔗 |
Fri 7:00 a.m. - 7:30 a.m.
|
Where to look next ? Different strategies for image exploration under partial observability.
(
Virtual invited talk
)
SlidesLive Video » |
Tinne Tuytelaars 🔗 |
Fri 7:30 a.m. - 8:00 a.m.
|
Incorporating Dynamic Structures into Pre-trained Language Models
(
Virtual invited talk
)
SlidesLive Video » Recent years have witnessed great success of large-scale pre-trained language models. However, performing the entire language model for each sample can be computationally uneconomical. Hence, dynamic networks are attracting a lot of attention in the NLP community, which can adapt their structures or parameters to the input samples during inference. In contrast to static language models, dynamic ones enjoy favorable properties such as efficiency, adaptiveness, accuracy, etc. In this talk, I will review recent advances on dynamic networks in NLP and discuss prospects and challenges of applying dynamic structure to pre-trained language models. |
Xuanjing Huang 🔗 |
Fri 8:00 a.m. - 8:15 a.m.
|
Does Continual Learning Equally Forget All Parameters?
(
Spotlight
)
SlidesLive Video » Continual learning (CL) on neural networks suffers from catastrophic forgetting due to the distribution or task shift. In this paper, we study which parts of neural nets are more prone to forgetting by investigating their training dynamics during CL. We discover that only a few modules (e.g., batch-norm, last layer, earlier convolutional layers) are more task-specific and sensitively alters between tasks, while others can be shared across tasks as common knowledge. Hence, we attribute forgetting mainly to the former and find that finetuning them on only a small buffer at the end of any CL method can bring non-trivial improvement.Due to their few parameters, such |
Haiyan Zhao · Tianyi Zhou · Guodong Long · Jing Jiang · Chengqi Zhang 🔗 |
Fri 8:15 a.m. - 8:30 a.m.
|
PA-GNN: Parameter-Adaptive Graph Neural Networks
(
Spotlight
)
SlidesLive Video » Many influential areas require effective extraction and processing of graph information. Graph neural networks (GNNs) have been a type of powerful tools to obtain informative representations regarding both topology and node features. With an increasing number of graph properties being proposed and analyzed (such as homophily/heterophily, edge density, motifs, and feature distribution), numerous specific GNNs have been designed to capture them individually. However, most existing GNNs assume the entire graph shares the same property, and enforce parameter sharing across all regions of the graph. In this work, we introduce a novel class of GNNs which adopt a node-specific aggregation scheme with adaptive parameters. The node-specific parameters are generated according to node's neighborhood pattern and global position. By testing our model on semi-supervised node classification tasks on synthetic graphs and real-world benchmarks, we show its superiority over fixed-parameter models. The underlying idea could be applied as a flexible extension to different GNNs and solve a wide range of graph tasks. |
Yuxin Yang · Yitao Liang · Muhan Zhang 🔗 |
Fri 8:30 a.m. - 8:45 a.m.
|
Triangular Dropout: Variable Network Width without Retraining
(
Spotlight
)
SlidesLive Video » One of the most fundamental choices in neural network design is layer width: it affects the capacity of what a network can learn and determines the complexity of the solution. The latter is often exploited when introducing information bottlenecks, forcing a network to learn compressed representations. Unfortunately, network architecture is typically immutable once training begins; switching to a more compressed architecture requires retraining. In this paper we present a new training strategy, Triangular Dropout, that allows effective compression without retraining. It provides for ordered removal of parameters by the user after training, enabling an explicit trade-off between performance and computational efficiency. We demonstrate the construction and utility of the approach through two examples. First, we formulate Triangular Dropout for autoencoders, creating models with configurable compression after training. Second, we apply Triangular Dropout to retrain the fully connected top layer of VGG19 on ImageNet. In both cases, we find only minimal degradation in the performance of the pruned network for even dramatic reductions in its number of parameters. |
Ted Staley · Jared Markowitz 🔗 |
Fri 8:45 a.m. - 8:00 a.m.
|
A Theoretical View on Sparsely Activated Networks
(
Spotlight
)
SlidesLive Video » Deep and wide neural networks successfully fit very complex functions today, but dense models are starting to be prohibitively expensive. To mitigate this, one promising research direction is networks that activate a sparse subgraph of the network. The subgraph is chosen by a data-dependent routing function, enforcing a fixed mapping of inputs to subnetworks (e.g., the Mixture of Experts (MoE) paradigm). However, there is little theoretical grounding for these sparsely activated models. As our first contribution, we present a formal model of such sparse networks that captures salient aspects of popular MoE architectures. Then, we show how to construct sparse networks that provably match the approximation power and total size of dense networks on Lipschitz functions. The sparse networks use exponentially fewer inference operations than dense networks, leading to a faster forward pass. This offers a theoretical insight into why sparse networks work well in practice. Finally, we present empirical findings that support our theory; compared to dense networks, sparse networks give a favorable trade-off between number of active units and approximation quality. |
Cenk Baykal · Nishanth Dikkala · Rina Panigrahy · Cyrus Rashtchian · Xin Wang 🔗 |
Fri 9:00 a.m. - 10:30 a.m.
|
Lunch break
|
🔗 |
Fri 10:30 a.m. - 11:30 a.m.
|
Dynamic neural networks: Present and Future
(
Discussion Panel
)
SlidesLive Video » Guests: Tal Schuster (MIT), Andrea Gesmundo (Google), Razvan Pascanu (DeepMind), Yanzhi Wang (Northeastern University) Moderators: Francesco Restuccia (Northeastern University), Neil Houlsby (Google Brain) |
Neil Houlsby 🔗 |
Fri 11:30 a.m. - 12:15 p.m.
|
Organizing memories for generalization in complementary learning systems
(
Keynote
)
SlidesLive Video » Our ability to remember the past is essential for guiding our future behavior. Psychological and neurobiological features of declarative memories are known to transform over time in a process known as systems consolidation. While many theories have sought to explain the time-varying role of hippocampal and neocortical brain areas, the computational principles that govern these transformations remain unclear. Here we propose a theory of systems consolidation in which hippocampal-cortical interactions serve to optimize generalizations that guide future adaptive behavior. We use mathematical analysis of neural network models to characterize fundamental performance tradeoffs in systems consolidation, revealing that memory components should be organized according to their predictability. The theory shows that multiple interacting memory systems can outperform just one, normatively unifying diverse experimental observations and making novel experimental predictions. Our results suggest that the psychological taxonomy and neurobiological organization of declarative memories reflect a system optimized for behaving well in an uncertain future. |
Weinan Sun 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
The Spike Gating Flow: A Hierarchical Structure Based Spiking Neural Network for Spatiotemporal Computing
(
Poster
)
Current deep learning faces major challenges for action recognition tasks because of: 1) the huge computational cost and 2) the inefficient learn- ing. Hence, we develop a novel Spiking Neural Network (SNN) titled Spiking Gating Flow (SGF) for such a dilemma. The developed system consists of multiple SGF units which assembled in a hierarchical manner. A single SGF unit involves three layers: a feature extraction layer, an event-driven layer, and a histogram-based train- ing layer. By employing a dynamic visions sensor gesture dataset, the results indicate that we can achieve 87.5% accuracy which is comparable with Deep Learning (DL), but at smaller train- ing/inference data number ratio 1.5:1. And only a single training epoch is required during the learn- ing process. At last, we conclude the few-shot learning paradigm of the developed network: 1) a hierarchical structure-based network design involves human prior knowledge; 2) SNNs for con- tent based global dynamic feature detection. |
Zihao Zhao · Yanhong Wang · Qiaosha Zou · Xiaoan Wang · C.-J. Richard Shi · Junwen Luo 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Back to the Source: Test-Time Diffusion-Driven Adaptation
(
Poster
)
Test-time adaptation harnesses test inputs to im- prove the accuracy of a model trained on source data when tested on shifted target data. Existing methods update the source model by (re- )training on each target domain. While effective, re-training is sensitive to the amount and order of the data and the hyperparameters for optimization. We instead update the target data, by projecting all test inputs toward the source domain with a generative diffusion model. Our diffusion-driven adaptation method, DDA, shares its models for classification and generation across all domains. Both models are trained on the source domain, then fixed during testing. We augment diffusion with image guidance and self-ensembling to automatically decide how much to adapt. Input adaptation by DDA is more robust than prior model adaptation approaches across a variety of corruptions, architectures, and data regimes on the ImageNet- C benchmark. With its input-wise updates, DDA succeeds where model adaptation degrades on too little data (small batches), on dependent data (non-random order), or on mixed data (multiple corruptions). |
Jin Gao · Jialing Zhang · Xihui Liu · Trevor Darrell · Evan Shelhamer · Dequan Wang 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Dynamic Split Computing for Efficient Deep Edge Intelligence
(
Poster
)
Deploying deep neural networks (DNNs) on IoT and mobile devices is a challenging task due to their limited computational resources. Thus, demanding tasks are often entirely offloaded to edge servers which can accelerate inference, however, it also causes communication cost and evokes privacy concerns. In addition, this approach leaves the computational capacity of end devices unused. Split computing is a paradigm where a DNN is split into two sections; the first section is executed on the end device, and the output is transmitted to the edge server where the final section is executed. Here, we introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel. By using natural bottlenecks that already exist in modern DNN architectures, dynamic split computing avoids retraining and hyperparameter optimization, and does not have any negative impact on the final accuracy of DNNs. Through extensive experiments, we show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time. |
Arian Bakhtiarnia · Nemanja Milosevic · Qi Zhang · Dragana Bajovic · Alexandros Iosifidis 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Learning Modularity for Generalizable Robotic Behaviors
(
Poster
)
Modularity in deep neural networks has provided scaling efficiencies leading to state of the art performance across multiple domains. A critical challenge for these networks is how to build and maintain a library of generalizable behavior modules. In this work, we propose a novel framework for building and maintaining a library of behavior primitives called Primitive Imitation for Control (PICO). Unlabeled demonstrations are automatically decomposed into existing or missing sub-behaviors which allows the framework to identify novel behaviors while not duplicating existing behaviors. We compared our results to several related approaches across two environments and achieve both better label accuracy and reconstruction accuracy as measured by action prediction mean squared error. |
Corban Rivera · Chace Ashcraft · Katie Popek · Edward Staley · Kapil Katyal 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Inductive Biases for Object-Centric Representations in the Presence of Complex Textures
(
Poster
)
Understanding which inductive biases could be helpful for the unsupervised learning of object-centric representations of natural scenes is challenging. In this paper, we use neural style transfer to generate datasets where objects have complex textures while still retaining ground-truth annotations. We find that methods that use a single module to reconstruct both the shape and visual appearance of each object learn more useful representations and achieve better object separation. In addition, we observe that adjusting the latent space size is insufficient to improve segmentation performance. Finally, the downstream usefulness of the representations is significantly more strongly correlated with segmentation quality than with reconstruction accuracy. |
Samuele Papa · Ole Winther · Andrea Dittadi 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Noisy Heuristics NAS: A Network Morphism based Neural Architecture Search using Heuristics
(
Poster
)
Network Morphism based Neural Architecture Search (NAS) is one of the most efficient methods, however, knowing where and when to add new neurons or remove dis-functional ones is generally left to black-box Reinforcement Learning models. In this paper, we present a new Network Morphism based NAS called Noisy Heuristics NAS which uses heuristics learned from manually developing neural network models and inspired by biological neuronal dynamics. Firstly, we add new neurons randomly and prune away some to select only the best fitting neurons. Secondly, we control the number of layers in the network using the relationship of hidden units to the number of input-output connections. Our method can increase or decrease the capacity or non-linearity of models online which is specified with a few meta-parameters by the user. Our method generalizes both on toy datasets and on real-world data sets such as MNIST, CIFAR-10, and CIFAR-100. The performance is comparable to the hand-engineered architecture ResNet-18 with the similar parameters. |
Suman Sapkota · Binod Bhattarai 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
FLOWGEN: Fast and slow graph generation
(
Poster
)
We present FLOWGEN, a graph-generation model inspired by the dual-process theory of mind that generates large graphs incrementally. Depending on the difficulty of completing the graph at the current step, graph generation is routed to either a weak or a strong model. Weak and strong models have identical architectures, but vary in the number of parameters and consequently the strength. Experiments on three diverse, real-world graphs show that FLOWGEN can successfully generate graphs similar to those generated by a single large model in a fraction of time. |
Aman Madaan · Yiming Yang 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Fault-Tolerant Collaborative Inference through the Edge-PRUNE Framework
(
Poster
)
Collaborative inference has received significant research interest in machine learning as a vehicle for distributing computation load, reducing latency, as well as addressing privacy preservation in communications. Recent collaborative inference frameworks have adopted dynamic inference methodologies such as early-exit and run-time partitioning of neural networks. However, as machine learning frameworks scale in the number of inference inputs, e.g., in surveillance applications, fault tolerance related to device failure needs to be considered. This paper presents the Edge-PRUNE distributed computing framework, built on a formally defined model of computation, which provides a flexible infrastructure for fault tolerant collaborative inference. The experimental section of this work shows results on achievable inference time savings by collaborative inference, presents fault tolerant system topologies and analyzes their cost in terms of execution time overhead. |
Jani Boutellier · Bo Tan · Jari Nurmi 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Vote for Nearest Neighbors Meta-Pruning of Self-Supervised Networks
(
Poster
)
Pruning plays an essential role in deploying deep neural nets (DNNs) to the hardware of limited memory or computation. However, current high-quality iterative pruning can create a terrible carbon footprint when compressing a large DNN for a wide variety of devices and tasks. Can we reuse the pruning results on previous tasks to accelerate the pruning for a new task? Can we find a better initialization for a new task? We study this |
Haiyan Zhao · Tianyi Zhou · Guodong Long · Jing Jiang · Chengqi Zhang 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
A Product of Experts Approach to Early-Exit Ensembles
(
Poster
)
Ensembles are often expensive to evaluate since they require running multiple models—each of which is costly in the case of neural networks. Using ensembles in compute-constrained applications would be much more practical if just a subset of the models could be evaluated. We address this issue with a novel product-of-experts-based method for early-exit ensembling. We rely on the fact that the product of finite-support probability distributions (e.g., the continuous uniform) has support less than or equal to that of the multiplicands. Thus, by setting a confidence threshold, we can stop evaluating ensemble members once the size of the support has been sufficiently reduced. We demonstrate our methodology for both real-value regression and multi-class classification. |
James Allingham · Eric Nalisnick 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Neural Architecture Search with Loss Flatness-aware Measure
(
Poster
)
We propose a new proxy measure for Neural Architecture Search (NAS) focusing on the flatness of loss surface. One step forward to the existing NAS studies utilizing the validation-set accuracy or angle which measures convergence speed during training, we claim that the flatness of the loss surface can be a promising proxy for predicting the generalization capability of neural network architectures. |
Joonhyun Jeong · Joonsang Yu · Dongyoon Han · YoungJoon Yoo 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Is a Modular Architecture Enough?
(
Poster
)
Inspired from human cognition, machine learning systems are now revealing advantages of sparser and more modular architectures. Recent works demonstrate that not only do some modular architectures generalize well, but they also lead to better out-of-distribution generalization, scaling properties, learning speed, and interpretability. A key intuition behind the success of such systems is that the data generating system for most real-world settings is considered to consist of sparsely interacting parts, promoting the use of similar inductive biases in the models. However, the field has been lacking in a rigorous quantitative assessment of such systems because these real-world data distributions are complex and unknown. Hence, we provide a thorough assessment of common modular architectures, through the lens of simple and known modular data distributions. We highlight the benefits of modularity and sparsity and reveal insights on the challenges faced while optimizing modular systems. We also propose evaluation metrics that highlight the regimes in which these benefits of modularity are substantial, as well as the sub-optimality of current end-to-end learned modular systems as opposed to their claimed potential. |
Sarthak Mittal · Yoshua Bengio · Guillaume Lajoie 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Parameter efficient dendritic-tree neurons outperform perceptrons
(
Poster
)
Biological neurons are more powerful than thanartificial perceptrons, in part due to complex den-dritic input computations. Inspired to empowerthe perceptron with biologically inspired features,we explore the effect of adding and tuning inputbranching factors along with input dropout. Thisallows for parameter efficient non-linear input ar-chitectures to be discovered and benchmarked.Furthermore, we developed an encapsulated Py-Torch module to tune and replace multi-layer per-ceptron layers in existing architectures. Our ini-tial experiments on MNIST classification demon-strate the accuracy and generalization improve-ment of artificial neurons with dendritic featurescompared to existing perceptron architectures. |
Ziwen Han · Evgeniya Gorobets · Pan Chen 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Single, Practical and Fast Dynamic Truncation Kernel Multiplication
(
Poster
)
Computing the product of a kernel matrix and a vector is the most basic and important operation in high-performance machine learning and scientific computing. The speed for this calculation determines plays a critical role in the overall performance of machine learning training and inference. As dataset sizes rapidly increase, the dimension of the kernel matrix also increase accordingly, and this product computation is increasingly a performance bottleneck. In the meantime, our observation is that many popular kernel matrices are inherently sparse, due to natural data distributions. In this paper, we design an efficient data structure to approximate kernel matrix vector multiplication. Our data structure is a search tree which enables us to quickly extract those entries and calculate the multiplication results. |
Lianke Qin · Somdeb Sarkhel · Zhao Song · Danyang Zhuo 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Confident Adaptive Language Modeling
(
Poster
)
We introduce Confident Adaptive Language Modeling (CALM), a framework for dynamically allocating different amounts of compute per input and generation timestep. Early exit decoding involves several challenges that we address here, such as: (1) what confidence measure to use; (2) connecting sequence-level constraints to local per-token exit decisions; and (3) attending back to missing hidden representations due to early exits in previous tokens. Through theoretical analysis and empirical experiments on three diverse text generation tasks, we demonstrate the efficacy of our framework in reducing compute---potential speedup of up to X3---while provably maintaining high performance. |
Tal Schuster · Adam Fisch · Jai Gupta · Mostafa Dehghani · Dara Bahri · Vinh Tran · Yi Tay · Don Metzler 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Provable Hierarchical Lifelong Learning with a Sketch-based Modular Architecture
(
Poster
)
We propose a modular architecture for lifelong learning of multiple hierarchically structured tasks. Specifically, we prove that our architecture is theoretically able to learn tasks that can be solved by functions that are learnable given access to functions for other, previously learned tasks as subroutines. We empirically show that some tasks that we can learn in this way are not learned by current modular lifelong learning or end-to-end training methods in practice; indeed, prior work suggests that some such tasks cannot be learned by \emph{any} efficient method without the aid of the simpler tasks. We also consider methods for identifying the tasks automatically, without relying on explicitly given indicators. |
ZIHAO DENG · Zee Fryer · Brendan Juba · Rina Panigrahy · Xin Wang 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
SnapStar Algorithm: a new way to ensemble Neural Networks
(
Poster
)
We propose a new neural network ensemble algorithm based on Audibert's empirical star algorithm and snapshot technique. We provide optimal theoretical minimax bound on the excess squared risk. Additionally, we empirically study this algorithm on regression and classification tasks and and show that it can be successfully applied to budget construct ensemble. |
Sergey Zinchenko · Dmitry Lishudi 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
HARNAS: Neural Architecture Search Jointly Optimizing for Hardware Efficiency and Adversarial Robustness of Convolutional and Capsule Networks
(
Poster
)
Neural Architecture Search (NAS) methodologies aim at finding efficient Deep Neural Network (DNN) models for a given application under given system constraints. DNNs are compute-intensive as well as vulnerable to adversarial attack threats. To address multiple design objectives, we propose HARNAS, a novel NAS framework that jointly optimizes for hardware-efficiency and adversarial-robustness of DNNs executed on specialized hardware accelerators. Besides the traditional convolutional DNNs, HARNAS extends the search for complex types of DNNs such as Capsule Networks. For reducing the exploration time, HARNAS selects appropriate values of adversarial perturbations to employ in the NAS algorithm. Our evaluations provide a set of Pareto-optimal solutions leveraging the tradeoffs between the above-discussed design objectives. |
Alberto Marchisio · Vojtech Mrazek · Andrea Massa · Beatrice Bussolino · Maurizio Martina · Muhammad Shafique 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Dynamic Transformer Networks
(
Poster
)
Deep neural networks have been very successful in recent years, some of which can be attributed to the introduction of Transformers. Dynamic neural networks on the other hand are being studied for better efficiency in various circumstances such as resource-constrained environments. Enabling transformers to be dynamic lets them only execute the needed layers of the models. In this work, we present a simple way of the oracle function that enables the model to determine the dependency of layers in transformers, just like soft attention. It can then be used as a strategy to skip layers without an RL agent. We show that such a model learns to skip on average, half of its layer for each sample in a batch input. |
Amanuel Mersha 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Just-in-Time Sparsity: Learning Dynamic Sparsity Schedules
(
Poster
)
Sparse neural networks have various computational benefits while often being able to maintain or improve the generalization performance of their dense counterparts. Popular sparsification methods have focused on what to sparsify, i.e. which redundant components to remove from neural networks, while when to sparsify, has received less attention and is usually handled using heuristics or simple schedules. In this work, we focus on learning sparsity schedules from scratch using reinforcement learning. In simple CNNs and ResNet-18, we show that our learned schedules are diverse across layers and training steps, while achieving competitive performance when compared to naive handcrafted schedules. Our methodology is general-purpose and can be applied to learning effective sparsity schedules across any pruning implementation. |
· Chiratidzo Matowe · Arnu Pretorius · Benjamin Rosman · Sara Hooker 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
FedHeN: Federated Learning in Heterogeneous Networks
(
Poster
)
We propose a novel training recipe for federated learning with heterogeneous networks where each device can have different architectures. We introduce training with a side objective to the devices of higher complexities which allows different architectures to jointly train in a federated setting. We empirically show that our approach improves training of different architectures and leads to high communication savings compared to state-of-the-art methods. |
Durmus Alp Emre Acar · Venkatesh Saligrama 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
APP: Anytime Progressive Pruning
(
Poster
)
With the latest advances in deep learning, several methods have been investigated for optimal learning settings in scenarios where the data stream is continuous over time. However, sparse networks training in such settings have often been overlooked. In this paper, we explore the problem of training a neural network with a target sparsity in a particular case of online learning: the anytime learning at macroscale paradigm (ALMA). We propose a novel way of progressive pruning, referred to as \textit{Anytime Progressive Pruning} (APP); the proposed approach significantly outperforms the baseline dense and Anytime OSP models across multiple architectures and datasets under short, moderate, and long-sequence training. Our method, for example, shows an improvement in accuracy of $\approx 7\%$ and a reduction in the generalisation gap by $\approx 22\%$, while being $\approx 1/3$ rd the size of the dense baseline model in few-shot restricted imagenet training. The code and experiment dashboards can be accessed at \url{https://github.com/landskape-ai/Progressive-Pruning} and \url{https://wandb.ai/landskape/APP}, respectively.
|
Diganta Misra · Bharat Runwal · Tianlong Chen · Zhangyang “Atlas” Wang · Irina Rish 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Deep Policy Generators
(
Poster
)
Traditional Reinforcement Learning (RL) learns policies that maximize expected return. Here we study neural nets (NNs) that learn to generate policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form ``generate a policy that achieves a desired expected return,'' our NN generators combine powerful exploration of parameter space with greedy command choices to iteratively find better and better policies. A form of weight-sharing HyperNetworks and policy embeddings scales our method to generate deep NNs. Experiments show how a single learned policy generator can produce policies that achieve any return seen during training. Finally, we evaluate our algorithm on a set of continuous control tasks where it exhibits competitive performance. |
Francesco Faccio · Vincent Herrmann · Aditya Ramesh · Louis Kirsch · Jürgen Schmidhuber 🔗 |
Fri 12:15 p.m. - 1:15 p.m.
|
Connectivity Properties of Neural Networks Under Performance-Resources Trade-off
(
Poster
)
We analyze the structure of network architectures obtained when trained under a performance-resources trade-off for various datasets.To this end, we use a flexible setup allowing for a neural network to learn both its size and topology during the course of a standard gradient-based training. The resulting network has the structure of a graph tailored to the particular learning task and dataset. We explore the properties of the resulting network architectures for a number of datasets of varying difficulty observing systematic regularities. The obtained graphs can be therefore understood as encoding nontrivial characteristics of the particular classification tasks. |
Aleksandra I. Nowak · Romuald A. Janik 🔗 |
Fri 1:15 p.m. - 1:45 p.m.
|
Deriving modular inductive biases from the principle of independent mechanisms
(
Invited talk
)
SlidesLive Video » Causal representation learning tackles the problem of discovering high-level variables from low-level observations. In this talk, I will discuss how modular architectures such as Neural Interpreters and Neural Attentive Circuits implement inductive biases from the causal principle of independent mechanisms. Leveraging dynamic connectivity graphs and conditional computatations, I will showcase their scalability and interesting properties for robust recognition, efficient transfer, and reasoning. |
Francesco Locatello 🔗 |
Fri 1:45 p.m. - 2:00 p.m.
|
Supernet Training for Federated Image Classification
(
Spotlight
)
SlidesLive Video » Efficient deployment of deep neural networks across many devices and resource constraints, especially on edge devices is one of the most challenging problems in the presence of data-privacy preservation issue. Conventional approaches have evolved to either improve a single global model while keeping each local training data decentralized (i.e., data-heterogeneity) or to train an once-for-all (OFA) network that supports diverse architectural settings to address heterogeneous clients equipped with different computational capabilities (i.e., model-heterogeneity). However, little research has considered both directions simultaneously. In this work, we propose a novel framework to consider both scenarios, namely Federation of Supernet Training (FedSup) where clients send and receive a supernet that containsall possible architectures sampled from itself. It is inspired by how averaging parameters in the model aggregation step of Federated Learning is very similar to weight sharing in supernet training. Specifically, in FedSup framework, a weight sharing approach widely used in the training single shot model is combined with the averaging of Federated Learning (FedAvg). Under our framework, we present a communication-efficient algorithm(CE-FedSup) by sending the sub-model to clients in the broadcast stage. We demonstrate several strategies to enhance supernet training in FL environment and conduct extensive empirical evaluations. The resulting framework is shown to provide robustness to both data- and model- heterogeneity on several standard benchmarks and a medical dataset. |
Taehyeon Kim · Se-Young Yun 🔗 |
Fri 2:00 p.m. - 2:15 p.m.
|
Achieving High TinyML Accuracy through Selective Cloud Interactions
(
Spotlight
)
SlidesLive Video »
Edge devices provide inference on predictive tasks to many end-users. However, deploying neural networks that achieve state-of-the-art accuracy on edge is infeasible due to resource constraints. Nevertheless, cloud-only processing is also problematic since uploading large amounts of data imposes severe communication bottlenecks. We propose a novel end-to-end hybrid learning framework that allows the edge to selectively query only those hard examples that the cloud classifies correctly. It trains edge, cloud predictors, and routing to maximize accuracy while minimizing the latency. Training a hybrid learner is difficult since we lack annotations of hard edge-examples. We introduce a novel proxy supervision in this context and show that our method adapts near optimally across different latency regimes. On the ImageNet dataset, our proposed method deployed on a micro-controller unit exhibits $25\%$ reduction in latency compared to cloud-only processing while suffering no excess loss.
|
Anil Kag · Igor Fedorov · Aditya Gangrade · Paul Whatmough · Venkatesh Saligrama 🔗 |
Fri 2:15 p.m. - 2:30 p.m.
|
Slimmable Quantum Federated Learning
(
Spotlight
)
SlidesLive Video » Quantum federated learning (QFL) has recently received increasing attention, where quantum neural networks (QNN)s are integrated into federated learning (FL). In contrast to the existing static QFL methods, we propose slimmable QFL (SlimQFL) in this article which is a dynamic QFL framework that can cope with time-varying communication channels and computing energy limitations. This is made viable by leveraging the unique nature of a QNN where its angle parameters and pole parameters can be separately trained and dynamically exploited. Simulation results corroborate that SlimQFL achieves higher classification accuracy than Vanilla QFL, particularly under poor channel conditions on average. |
Won Joon Yun · Jae Pyoung Kim · Soyi Jung · Jihong Park · Mehdi Bennis · Joongheon Kim 🔗 |
Fri 2:30 p.m. - 2:45 p.m.
|
Sparse Relational Reasoning with Object-centric Representations
(
Spotlight
)
SlidesLive Video » We investigate the composability of soft-ruleslearned by relational neural architectures whenoperating over object-centric (slot-based) repre-sentations, under a variety of sparsity-inducingconstraints. We find that increasing sparsity, es-pecially on features, improves the performanceof some models and leads to simpler relations.Additionally, we observe that object-centric repre-sentations can be detrimental when not all objectsare fully captured; a failure mode to which simpleCNNs are less vulnerable. These findings high-light the trade-offs between interpretability andperformance, even for models designed to tacklerelational tasks. |
Alex Spies 🔗 |
Fri 2:45 p.m. - 3:00 p.m.
|
Play It Cool: Dynamic Shifting Prevents Thermal Throttling
(
Spotlight
)
SlidesLive Video » Machine learning (ML) has entered the mobile era where an enormous number of ML models are deployed on edge devices. However, running common ML models on edge devices continuously may generate excessive heat from the computation, forcing the device to ”slow down” to prevent overheating, a phenomenon called thermal throttling. This paper studies the impact of thermal throttling on mobile phones: when it occurs, the CPU clock frequency is reduced, and the model inference latency may increase dramatically. This unpleasant inconsistent behavior has a substantial negative effect on user experience, but it has been overlooked for a long time. To counter thermal throttling, we propose to utilize dynamic networks with shared weights and dynamically shift between large and small ML models seamlessly according to their thermal profile, i.e., shifting to a small model when the system is about to throttle. With the proposed dynamic shifting, the application runs consistently without experiencing CPU clock frequency degradation and latency increase. In addition, we also study the resulting accuracy when dynamic shifting is deployed and show that our approach provides a reasonable trade-off between model latency and model accuracy. |
Yang Zhou · Feng Liang · Ting-Wu Chin · Diana Marculescu 🔗 |
Fri 3:00 p.m. - 3:15 p.m.
|
Efficient Sparsely Activated Transformers
(
Spotlight
)
SlidesLive Video » Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains including natural language processing and computer vision. To further improve their accuracy, recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert (MoE) layers. In this paper, we explore the introduction of such layers to optimize a different metric: inference latency. We introduce a novel system named PLANER that takes an existing Transformer-based network and a user-defined latency target and produces an optimized, sparsely activated version of the original network that tries to meet the latency target while maintaining baseline accuracy. We evaluate PLANER on two real-world language modeling tasks using the Transformer-XL network and achieve inference latency reductions of over 2x at iso-accuracy. |
Salar Latifi · Saurav Muralidharan · Michael Garland 🔗 |
Fri 3:15 p.m. - 5:00 p.m.
|
Networking & happy hour
|
🔗 |