Object-Oriented Learning: Perception, Representation, and Reasoning

Sungjin Ahn, Adam Kosiorek, Jessica Hamrick, Sjoerd van Steenkiste, Yoshua Bengio

Keywords:  Object-Oriented    Representation    Reasoning    Perception    Learning  


Objects, and the interactions between them, are the foundations on which our understanding of the world is built. Similarly, abstractions centered around the perception and representation of objects play a key role in building human-like AI, supporting high-level cognitive abilities like causal reasoning, object-centric exploration, and problem solving. Indeed, prior works have shown how relational reasoning and control problems can greatly benefit from having object descriptions. Yet, many of the current methods in machine learning focus on a less structured approach in which objects are only implicitly represented, posing a challenge for interpretability and the reuse of knowledge across tasks. Motivated by the above observations, there has been a recent effort to reinterpret various learning problems from the perspective of object-oriented representations.

In this workshop, we will showcase a variety of approaches in object-oriented learning, with three particular emphases. Our first interest is in learning object representations in an unsupervised manner. Although computer vision has made an enormous amount of progress in learning about objects via supervised methods, we believe that learning about objects with little to no supervision is preferable: it minimizes labeling costs, and also supports adaptive representations that can be changed depending on the particular situation and goal. The second primary interest of this workshop is to explore how object-oriented representations can be leveraged for downstream tasks such as reinforcement learning and causal reasoning. Lastly, given the central importance of objects in human cognition, we will highlight interdisciplinary perspectives from cognitive science and neuroscience on how people perceive and understand objects.

We have invited speakers whose research programs cover unsupervised and supervised 2-D and 3-D perception, reasoning, concept learning, reinforcement learning, as well as psychology and neuroscience. We will additionally source contributed works focusing on unsupervised object-centric representations, applications of such object-oriented representations (such as in reinforcement learning), and object-centric aspects of human cognition. To highlight and support research from a range of different perspectives, our invited speakers vary in their domain of expertise, institution, seniority, and gender. We will also encourage participation from underrepresented groups by providing travel grants courtesy of DeepMind and Kakao Brain. We are also planning to coordinate with the main conference and the speakers to provide remote access to the workshop.

Chat is not available.

Timezone: »


Fri 6:15 a.m. - 6:30 a.m.
Opening Remarks (Talk)
Sungjin Ahn
Fri 6:30 a.m. - 7:10 a.m.
 link »   

Recently, there has been a surge of interest for object-centric learning in neural network research. To many researchers, it seems clear that objects hold great potential for enabling more systematic generalisation, building compositional models of the world, and as grounding for language and symbolic reasoning. However, despite strong intuitions, a general definition of what constitutes an object is still lacking, and the precise notion of objects remains largely elusive. In this talk I aim to challenge some common intuitive conceptions about objects, and point to some of their subtle complexity. After that, I will present a few relevant findings from cognitive psychology regarding human object perception, and conclude by discussing a few challenges and promising approaches for incorporating objects into neural networks.

Klaus Greff
Fri 7:10 a.m. - 7:50 a.m.
 link »   

To enable explicit representation of objects in neural architectures, a core challenge lies in defining a mapping from input features (e.g., an image encoded by a CNN) to a set of abstract object representations. In this talk, I will discuss how attention mechanisms can be used in an iterative, competitive fashion to (a) efficiently group visual features into object slots and (b) segment temporal representations. I will further highlight how graph neural networks can be utilized to learn about interactions between objects and how object-centric models can be trained in a self-supervised fashion using contrastive losses.

Thomas Kipf
Fri 7:50 a.m. - 8:10 a.m.
 link »   

Given visual observations of a reaching task together with a stick-like tool, we propose a novel approach that learns to exploit task-relevant object affordances by combining generative modelling with a task-based performance predictor. The embedding learned by the generative model captures the factors of variation in object geometry, e.g. length, width, and configuration. The performance predictor identifies sub-manifolds correlated with task success in a weakly supervised manner. Using a 3D simulation environment, we demonstrate that traversing the latent space in this task-driven way results in appropriate tool geometries for the task at hand. Our results suggest that affordances are encoded along smooth trajectories in the learned latent space. Given only high-level performance criteria (such as task success), accessing these emergent affordances via gradient descent enables the agent to manipulate learned object geometries in a targeted and deliberate way.

Yizhe Wu
Fri 8:10 a.m. - 8:15 a.m.
 link »   

We propose a method for autonomously learning an object-centric representation of a high-dimensional environment that is suitable for planning. Such abstractions can be immediately transferred between tasks that share the same types of objects, resulting in agents that require fewer samples to learn a model of a new task. We demonstrate our approach on a series of Minecraft tasks to learn object-centric representations - directly from pixel data - that can be leveraged to quickly solve new tasks. The resulting learned representations enable the use of a task-level planner, resulting in an agent capable of forming complex, long-term plans.

Steve James
Fri 8:15 a.m. - 8:20 a.m.
 link »   

`Capsule' models try to explicitly represent the poses of objects, enforcing a linear relationship between an objects pose and those of its constituent parts. This modelling assumption should lead to robustness to viewpoint changes since the object-component relationships are invariant to the poses of the object. We describe a probabilistic generative model that encodes these assumptions. Our probabilistic formulation separates the generative assumptions of the model from the inference scheme, which we derive from a variational bound. We experimentally demonstrate the applicability of our unified objective, and the use of test time optimisation to solve problems inherent to amortised inference.

Lewis Smith
Fri 8:20 a.m. - 8:25 a.m.
 link »   

We propose a novel approach to representation learning based on object keypoints. It leverages the predictability of local image regions from spatial neighborhoods to identify salient regions that correspond to object parts, which are then converted to keypoints. Unlike prior approaches, this does not overly bias the keypoints to focus on a particular property of objects. We demonstrate the efficacy of our approach on Atari where we find that it learns keypoints corresponding to the most salient object parts and is more robust to certain visual distractors.

Anand Gopalakrishnan
Fri 8:25 a.m. - 8:30 a.m.
 link »   

Groups of entities are naturally represented as sets, but generative models usually treat them as independent from each other or as sequences. This either over-simplifies the problem, or imposes an order to the otherwise unordered collections, which has to be accounted for in loss computation. We therefore introduce GAST - a GAN for sets capable of generating variable-sized sets in a permutation-equivariant manner, while accounting for dependencies between set elements. It avoids the problem of formulating a distance metric between sets by using a permutation-invariant discriminator. When evaluated on a dataset of regular polygons and on MNIST point clouds, GAST outperforms graph-convolution-based GANs in sample fidelity, while showing good generalization to novel set sizes.

Karl Stelzner
Fri 8:30 a.m. - 9:30 a.m.
 link »

Please access the posters via the workshop website using Zoom room password: w00l

Fri 9:30 a.m. - 10:30 a.m.
 link »

Suggest questions via the link below.

Jessica Hamrick
Fri 10:30 a.m. - 11:10 a.m.
 link »   

Understanding causes and effects in mechanical systems is an essential component of reasoning in the physical world. This work poses a new problem of counterfactual learning of object mechanics from visual input. We develop the CoPhy benchmark to assess the capacity of the state-of-the-art models for causal physical reasoning in a synthetic 3D environment and propose a model for learning the physical dynamics in a counterfactual setting. Having observed a mechanical experiment that involves, for example, a falling tower of blocks, a set of bouncing balls or colliding objects, we learn to predict how its outcome is affected by an arbitrary intervention on its initial conditions, such as displacing one of the objects in the scene. The alternative future is predicted given the altered past and a latent representation of the confounders learned by the model in an end-to-end fashion with no supervision. We compare against feedforward video prediction baselines and show how observing alternative experiences allows the network to capture latent physical properties of the environment, which results in significantly more accurate predictions at the level of super human performance.

Fabien Baradel
Fri 11:10 a.m. - 11:50 a.m.
 link »   

Two-dimensional images are commonly used to study and model perceptual and cognitive processes because of the convenience and ease of experimental control they provide. However, real objects differ from pictures in many ways, including the potential for interaction and richer information about distance and thus size. Across a series of neuroimaging studies and behavioral experiments in adults, we have shown different responses to real objects than pictures. Moreover, we have found behavioral differences between real objects and pictures even in infants, suggesting that realness plays an important role in learning about objects. These results can inform the next generation of computational models as to how human brains learn to process objects in the real world.

Jody Culham
Fri 11:50 a.m. - 12:10 p.m.
 link »   

Many dynamic processes, including common scenarios in robotic control and reinforcement learning (RL), involve a set of interacting subprocesses. Though the subprocesses are not independent, their interactions are often sparse, and the dynamics at any given time step can often be decomposed into locally independent causal mechanisms. Such local causal structures can be leveraged to improve the sample efficiency of sequence prediction and off-policy reinforcement learning. We formalize this by introducing local causal models (LCMs), which are induced from a global causal model by conditioning on a subset of the state space. We propose an approach to inferring these structures given an object-oriented state representation, as well as a novel algorithm for model-free Counterfactual Data Augmentation (CoDA). CoDA uses local structures and an experience replay to generate counterfactual experiences that are causally valid in the global model. We find that CoDA significantly improves the performance of RL agents in locally factored tasks, including the batch-constrained and goal-conditioned settings.

Silviu Pitis
Fri 12:10 p.m. - 12:40 p.m.
Fri 12:40 p.m. - 1:20 p.m.
 link »   

Objects elicit attention in many everyday contexts, even from infancy. Objects also serve as the referents for humans’ earliest symbolic learning: language. In this talk, I’ll present my lab’s recent work with young children suggesting that objects are also prioritized in another early emerging and uniquely human symbolic expression: drawing. I’ll conclude my talk by suggesting that researchers interested in artificial intelligence may look for inspiration in human intelligence, especially when it comes to the way that humans attend to and represent objects.

Moira R Dillon
Fri 1:20 p.m. - 2:00 p.m.
 link »   

Learning depends on both the learning mechanism and the structure of the training data, yet most research in human learning and efforts in machine learning concentrate on the learning mechanisms. I will present evidence on the everyday-day ego-centric visual experiences of infants. The regularities differ fundamentally and in multiple inter-related from current approaches to training in machine learning and perhaps will offer inspiration to more powerful, more incremental, and more autonomous machine learning.

Linda Smith
Fri 2:00 p.m. - 2:40 p.m.
 link »   

How we represent signals has major implications for the algorithms we build to analyze them. Today, most signals are represented discretely: Images as grids of pixels, shapes as point clouds, audio as grids of amplitudes, etc. If images weren't pixel grids - would we be using convolutional neural networks today? What makes a good or bad representation? Can we do better? I will talk about leveraging emerging implicit neural representations for complex & large signals, such as room-scale geometry, images, audio, video, and physical signals defined via partial differential equations. By embedding an implicit scene representation in a neural rendering framework and learning a prior over these representations, I will show how we can enable 3D reconstruction from only a single posed 2D image. Finally, I will show how gradient-based meta-learning can enable fast inference of implicit representations, and how the features we learn in the process are already useful to the downstream task of semantic segmentation.

Vincent Sitzmann
Fri 2:40 p.m. - 3:20 p.m.
 link »   

Energy-based models are undergoing a resurgence of interest, but their applications have largely focused on generative modeling and density estimation. In this talk I will discuss application of energy-based models to object or concept oriented learning and reasoning. These models offer an elegant approach to concept composition, continual and unsupervised learning, and usage of concepts in multiple contexts. I will show examples of these advantages, and conclude with a set of future research directions.

Igor Mordatch
Fri 3:20 p.m. - 3:40 p.m.
 link »   

The physical world can be decomposed into discrete 3D objects. Reasoning about the world in terms of these objects may provide a number of advantages to learning agents. For example, objects interact compositionally, and this can support a strong form of generalization. Knowing properties of individual objects and rules for how those properties interact, one can predict the effects that objects will have on one another even if one has never witnessed an interaction between the types of objects in question. The promise of object-level reasoning has fueled a recent surge of interest in systems capable of learning to extract object-oriented representations from perceptual input without supervision. However, the vast majority of such systems treat objects as 2D entities, effectively ignoring their 3D nature. In the current work, we propose a probabilistic, object-oriented model equipped with the inductive bias that the world is made up of 3D objects moving through a 3D world, and make a number of structural adaptations which take advantage of that bias. In a series of experiments we show that this system is capable not only of segmenting objects from the perceptual stream, but also of extracting 3D information about objects (e.g. depth) and of tracking them through 3D space.

Eric Crawford
Fri 3:40 p.m. - 3:45 p.m.
 link »   

The ability to build a wide array of physical structures, from sand castles to skyscrapers, is a hallmark of human intelligence. What computational mechanisms enable humans to reason about how such structures are built? Here we conduct an empirical investigation of how people solve challenging physical assembly problems and update their policies across repeated attempts. Participants viewed silhouettes of 8 unique towers in a 2D virtual environment simulating rigid-body physics, and aimed to reconstruct each one using a fixed inventory of rectangular blocks. We found that people learned to build each target tower more accurately across repeated attempts, and that these gains reflect both group-level convergence upon a smaller set of viable policies, as well as error-dependent updating of each individual's policy. Taken together, our study provides a novel benchmark for evaluating how well algorithmic models of physical reasoning and planning correspond to human behavior.

Will P McCarthy
Fri 3:45 p.m. - 3:50 p.m.
 link »   

Learning-based 3D object reconstruction enables single- or few-shot estimation of 3D object models. For robotics this holds the potential to allow model-based methods to rapidly adapt to novel objects and scenes. Existing 3D reconstruction techniques optimize for visual reconstruction fidelity, typically measured by chamfer distance or voxel IOU. We find that when applied to realistic, cluttered robotics environments these systems produce reconstructions with low physical realism, resulting in poor task performance when used for model-based control. We propose ARM an amodal 3D reconstruction system that introduces (1) an object stability prior over the shapes of groups of objects, (2) an object connectivity prior over object shapes, and (3) a multi-channel input representation and reconstruction objective that allows for reasoning over relationships between groups of objects. By using these priors over the physical properties of objects, our system improves reconstruction quality not just by standard visual metrics, but also improves performance of model-based control on a variety of robotics manipulation tasks in challenging, cluttered environments.

William Agnew
Fri 3:50 p.m. - 3:55 p.m.
 link »   

Compositional structures between parts and objects are inherent in natural scenes. Recent work on representation learning has succeeded in modeling scenes as composition of objects, but further decomposition of objects into parts and subparts has largely been overlooked. In this paper, we propose RICH, the first deep latent variable model for learning Representation of Interpretable Compositional Hierarchies. At the core of RICH is a latent scene graph representation that organizes the entities of a scene into a tree according to their compositional relationships. During inference, RICH takes a top-down approach, allowing higher-level representation to guide lower-level decomposition in case there is compositional ambiguity. In experiments on images containing multiple compositional objects, we demonstrate that RICH is able to learn the latent compositional hierarchy, generate imaginary scenes, and improve data efficiency in downstream tasks.

Fei Deng
Fri 3:55 p.m. - 4:00 p.m.
 link »   

A set is an unordered collection of unique elements—and yet many machine learning models that generate sets impose an implicit or explicit ordering. Since model performance can depend on the choice of ordering, any particular ordering can lead to sub-optimal results. An alternative solution is to use a permutation-equivariant set generator, which does not specify an order-ing. An example of such a generator is the DeepSet Prediction Network (DSPN). We introduce the Transformer Set Prediction Network (TSPN), a flexible permutation-equivariant model for set prediction based on the transformer, that builds upon and outperforms DSPN in the quality of predicted set elements and in the accuracy of their predicted sizes. We test our model on MNIST-as-point-clouds (SET-MNIST) for point-cloud generation and on CLEVR for object detection.

Adam Kosiorek
Fri 4:00 p.m. - 4:55 p.m.
 link »

Please access the posters via the workshop website using Zoom room password: w00l