Skip to yearly menu bar Skip to main content


Session

Deep Learning (Neural Network Architectures) 12

Abstract:
Chat is not available.

Fri 13 July 7:00 - 7:20 PDT

Progress & Compress: A scalable framework for continual learning

Jonathan Schwarz · Wojciech Czarnecki · Jelena Luketina · Agnieszka Grabska-Barwinska · Yee Teh · Razvan Pascanu · Raia Hadsell

We introduce a conceptually simple and scalable framework for continual learning domains where tasks are learned sequentially. Our method is constant in the number of parameters and is designed to preserve performance on previously encountered tasks while accelerating learning progress on subsequent problems. This is achieved by training a network with two components: A knowledge base, capable of solving previously encountered problems, which is connected to an active column that is employed to efficiently learn the current task. After learning a new task, the active column is distilled into the knowledge base, taking care to protect any previously acquired skills. This cycle of active learning (progression) followed by consolidation (compression) requires no architecture growth, no access to or storing of previous data or tasks, and no task-specific parameters. We demonstrate the progress & compress approach on sequential classification of handwritten alphabets as well as two reinforcement learning domains: Atari games and 3D maze navigation.

Fri 13 July 7:20 - 7:40 PDT

Overcoming Catastrophic Forgetting with Hard Attention to the Task

Joan SerrĂ  · Didac Suris · Marius Miron · Alexandros Karatzoglou

Catastrophic forgetting occurs when a neural network loses the information learned in a previous task after training on subsequent tasks. This problem remains a hurdle for artificial intelligence systems with sequential learning capabilities. In this paper, we propose a task-based hard attention mechanism that preserves previous tasks' information without affecting the current task's learning. A hard attention mask is learned concurrently to every task, through stochastic gradient descent, and previous masks are exploited to condition such learning. We show that the proposed mechanism is effective for reducing catastrophic forgetting, cutting current rates by 45 to 80%. We also show that it is robust to different hyperparameter choices, and that it offers a number of monitoring capabilities. The approach features the possibility to control both the stability and compactness of the learned knowledge, which we believe makes it also attractive for online learning or network compression applications.

Fri 13 July 7:40 - 7:50 PDT

Rapid Adaptation with Conditionally Shifted Neurons

Tsendsuren Munkhdalai · Xingdi Yuan · Soroush Mehri · Adam Trischler

We describe a mechanism by which artificial neural networks can learn rapid adaptation - the ability to adapt on the fly, with little data, to new tasks - that we call conditionally shifted neurons. We apply this mechanism in the framework of metalearning, where the aim is to replicate some of the flexibility of human learning in machines. Conditionally shifted neurons modify their activation values with task-specific shifts retrieved from a memory module, which is populated rapidly based on limited task experience. On metalearning benchmarks from the vision and language domains, models augmented with conditionally shifted neurons achieve state-of-the-art results.

Fri 13 July 7:50 - 8:00 PDT

Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace

Yoonho Lee · Seungjin Choi

Gradient-based meta-learning methods leverage gradient descent to learn the commonalities among various tasks.While previous such methods have been successful in meta-learning tasks, they resort to simple gradient descent during meta-testing.Our primary contribution is the {\em MT-net}, which enables the meta-learner to learn on each layer's activation space a subspace that the task-specific learner performs gradient descent on.Additionally, a task-specific learner of an {\em MT-net} performs gradient descent with respect to a meta-learned distance metric,which warps the activation space to be more sensitive to task identity.We demonstrate that the dimension of this learned subspace reflects the complexity of the task-specific learner's adaptation task, and also that our model is less sensitive to the choice of initial learning rates than previous gradient-based meta-learning methods.Our method achieves state-of-the-art or comparable performance on few-shot classification and regression tasks.