Poster
in
Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning
Does End-to-End Visual Pretraining Help Reasoning?
Chen Sun · Calvin Luo · Xingyi Zhou · Anurag Arnab · Cordelia Schmid
We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network generalist'' to solve visual recognition and reasoning tasks. We propose a simple and general self-supervised framework which
compresses'' each video frame into a small set of tokens with a transformer network, and reconstructs the remaining frames based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We evaluate the proposed approach on two visual reasoning benchmarks, CATER and ACRE. We observe that end-to-end pretraining is essential to achieve compositional generalization for visual reasoning.Our proposed framework outperforms traditional supervised pretraining, both with classification or with explicit object detection, by large margins.