Timezone: »

Improving multimodal datasets with image captioning
Thao Nguyen · · Gabriel Ilharco · Sewoong Oh · Ludwig Schmidt

Massive web datasets play a key role in the success of large vision-language models such as CLIP and Flamingo. However, the raw data is noisy, and existing methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies the effectiveness of synthetic captions in increasing the utility of web-scraped datapoints with poorly aligned captions. Through exploring different mixing strategies for raw and synthetic captions, we achieve state-of-the-art performance at the small and medium scales of the DataComp benchmark (Gadre et al., 2023), improving ImageNet accuracy by 2% and average accuracy (over 38 tasks) by 4% compared to the previous best baseline, given a candidate pool of 128M image-text pairs. The best-performing approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions so effective, and explore the impact of image captioning model and sampling temperature on the resulting training set. Overall our findings demonstrate the potential of leveraging image-captioning models as a way to improve multimodal datasets, as (i) we show that progress in image captioning models can translate to better captions and boost accuracy, and (ii) this unlocks a plethora of web images without accompanying captions that can now be used for training.

Author Information

Thao Nguyen (University of Washington)
Gabriel Ilharco (University of Washington)
Sewoong Oh (University of Washington)
Ludwig Schmidt (University of Washington)

More from the Same Authors