Track: Deep Learning Algorithms 3

Tue 20 July 7:00 - 7:20 PDT

Oral

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Wonjae Kim · Bokyung Son · Ildoo Kim

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance. Our code and pre-trained weights are available at https://github.com/dandelin/vilt.

Tue 20 July 7:20 - 7:25 PDT

Spotlight

Learning Curves for Analysis of Deep Networks

Derek Hoiem · Tanmay Gupta · Zhizhong Li · Michal Shlapentokh-Rothman

Learning curves model a classifier's test error as a function of the number of training samples. Prior works show that learning curves can be used to select model parameters and extrapolate performance. We investigate how to use learning curves to evaluate design choices, such as pretraining, architecture, and data augmentation. We propose a method to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of different parameterizations. Our experiments exemplify use of learning curves for analysis and yield several interesting observations.

Tue 20 July 7:25 - 7:30 PDT

Spotlight

GLSearch: Maximum Common Subgraph Detection via Learning to Search

Yunsheng Bai · Derek Xu · Yizhou Sun · Wei Wang

Detecting the Maximum Common Subgraph (MCS) between two input graphs is fundamental for applications in drug synthesis, malware detection, cloud computing, etc. However, MCS computation is NP-hard, and state-of-the-art MCS solvers rely on heuristic search algorithms which in practice cannot find good solution for large graph pairs given a limited computation budget. We propose GLSearch, a Graph Neural Network (GNN) based learning to search model. Our model is built upon the branch and bound algorithm, which selects one pair of nodes from the two input graphs to expand at a time. We propose a novel GNN-based Deep Q-Network (DQN) to select the node pair, making the search process much faster. Experiments on synthetic and real-world graph pairs demonstrate that our model learns a search strategy that is able to detect significantly larger common subgraphs than existing MCS solvers given the same computation budget. GLSearch can be potentially extended to solve many other combinatorial problems with constraints on graphs.

Tue 20 July 7:30 - 7:35 PDT

Spotlight

Learning Intra-Batch Connections for Deep Metric Learning

Jenny Seidenschwarz · Ismail Elezi · Laura Leal-Taixé

The goal of metric learning is to learn a function that maps samples to a lower-dimensional space where similar samples lie closer than dissimilar ones. Particularly, deep metric learning utilizes neural networks to learn such a mapping. Most approaches rely on losses that only take the relations between pairs or triplets of samples into account, which either belong to the same class or two different classes. However, these methods do not explore the embedding space in its entirety. To this end, we propose an approach based on message passing networks that takes all the relations in a mini-batch into account. We refine embedding vectors by exchanging messages among all samples in a given batch allowing the training process to be aware of its overall structure. Since not all samples are equally important to predict a decision boundary, we use an attention mechanism during message passing to allow samples to weigh the importance of each neighbor accordingly. We achieve state-of-the-art results on clustering and image retrieval on the CUB-200-2011, Cars196, Stanford Online Products, and In-Shop Clothes datasets. To facilitate further research, we make available the code and the models at https://github.com/dvl-tum/intrabatchconnections.

Tue 20 July 7:35 - 7:40 PDT

Spotlight

Simultaneous Similarity-based Self-Distillation for Deep Metric Learning

Karsten Roth · Timo Milbich · Bjorn Ommer · Joseph Paul Cohen · Marzyeh Ghassemi

Deep Metric Learning (DML) provides a crucial tool for visual similarity and zero-shot retrieval applications by learning generalizing embedding spaces, although recent work in DML has shown strong performance saturation across training objectives. However, generalization capacity is known to scale with the embedding space dimensionality. Unfortunately, high dimensional embeddings also create higher retrieval cost for downstream applications. To remedy this, we propose S2SD - Simultaneous Similarity-based Self-distillation. S2SD extends DML with knowledge distillation from auxiliary, high-dimensional embedding and feature spaces to leverage complementary context during training while retaining test-time cost and with negligible changes to the training time. Experiments and ablations across different objectives and standard benchmarks show S2SD offering highly significant improvements of up to 7% in Recall@1, while also setting a new state-of-the-art.

Tue 20 July 7:40 - 7:45 PDT

Spotlight

Unifying Vision-and-Language Tasks via Text Generation

Jaemin Cho · Jie Lei · Hao Tan · Mohit Bansal

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j-min/VL-T5

Tue 20 July 7:45 - 7:50 PDT

Spotlight

DeepWalking Backwards: From Embeddings Back to Graphs

Sudhanshu Chanpuriya · Cameron Musco · Konstantinos Sotiropoulos · Charalampos Tsourakakis

Low-dimensional node embeddings play a key role in analyzing graph datasets. However, little work studies exactly what information is encoded by popular embedding methods, and how this information correlates with performance in downstream learning tasks. We tackle this question by studying whether embeddings can be inverted to (approximately) recover the graph used to generate them. Focusing on a variant of the popular DeepWalk method \cite{PerozziAl-RfouSkiena:2014, QiuDongMa:2018}, we present algorithms for accurate embedding inversion -- i.e., from the low-dimensional embedding of a graph $G$ , we can find a graph $\tilde G$ with a very similar embedding. We perform numerous experiments on real-world networks, observing that significant information about $G$ , such as specific edges and bulk properties like triangle density, is often lost in $\tilde G$ . However, community structure is often preserved or even enhanced. Our findings are a step towards a more rigorous understanding of exactly what information embeddings encode about the input graph, and why this information is useful for learning tasks.

Tue 20 July 7:50 - 7:55 PDT

Q&A

Main Navigation

Session

Deep Learning Algorithms 3

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Learning Curves for Analysis of Deep Networks

GLSearch: Maximum Common Subgraph Detection via Learning to Search

Learning Intra-Batch Connections for Deep Metric Learning

Simultaneous Similarity-based Self-Distillation for Deep Metric Learning

Unifying Vision-and-Language Tasks via Text Generation

DeepWalking Backwards: From Embeddings Back to Graphs

Q&A