Skip to yearly menu bar Skip to main content

Workshop: The First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward

Generative Self-training Improves Pre-training for Visual Dialog

Gi-Cheon Kang · Gi-Cheon Kang · Sungdong Kim · Sungdong Kim · Jin-Hwa Kim · Jin-Hwa Kim · Donghyun Kwak · Donghyun Kwak · Byoung-Tak Zhang · Byoung-Tak Zhang


Visual dialog (VisDial) is a task of answering a series of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog models solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for VisDial, called \textit{Generative Self-Training} (GST), to enhance the pre-training. Specifically, GST generates synthetic dialog data for unlabeled images via multimodal conditional text generation and trains the dialog model on the synthetic and the original VisDial data. Moreover, we also propose perplexity-based data selection and multimodal consistency regularization for robust training of the synthetic data. Evaluation on VisDial v1.0 dataset shows that GST improves the pre-training and achieves new state-of-the-art results.

Chat is not available.