Formalizing the Binding Problem
Abstract
Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with distracting objects. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments measuring binding information in different datasets with different number of features, different occlusion levels of objects, synthetic (e.g., red, circle) versus natural features (e.g., bikes, running), as well as out-of-distribution feature combinations (e.g. blue penguins), while performing these experiments on several pre-trained ViTs. Our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.