Timezone: »

Unifying Modalities, Tasks, and Architectures Through a Simple Sequence-to-Sequence Learning Framework
Peng Wang · An Yang · Rui Men · Junyang Lin · Shuai Bai · Zhikang Li · Jianxin Ma · Chang Zhou · Jingren Zhou · Hongxia Yang

Thu Jul 21 03:00 PM -- 05:00 PM (PDT) @ #321

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework based on the encoder-decoder architecture. OFA performs pretraining and finetuning with task instructions and introduces no extra task-specific layers for finetuning. Experimental results show that OFA achieves new state-of-the-arts on a series of multimodal tasks, including image captioning (COCO test CIDEr: 149.6), text-to-image generation (COCO test FID: 10.5), VQA (test-std acc.: 80.02), SNLI-VE (test acc.: 90.20), and referring expression comprehension (RefCOCO / RefCOCO+ / RefCOCOg test acc.: 92.93 / 90.10 / 85.20). Through extensive analyses, we demonstrate that OFA reaches comparable performance with uni-modal pretrained models (e.g., BERT, MAE, MoCo v3, SimCLR v2, etc.) in uni-modal tasks, including NLU, NLG, and image classification, and it effectively transfers to unseen tasks and domains. Code shall be released soon.

Author Information

Peng Wang (Alibaba Group)
An Yang (Alibaba Group)
Rui Men (Alibaba Group)
Junyang Lin (Alibaba Group)
Shuai Bai (Alibaba Group)
Zhikang Li (DAMO Academy, Alibaba Group)
Jianxin Ma (Alibaba Group)
Chang Zhou (Alibaba Group)
Jingren Zhou (Alibaba Group)
Hongxia Yang (Alibaba Group)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors