Timezone: »
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework based on the encoder-decoder architecture. OFA performs pretraining and finetuning with task instructions and introduces no extra task-specific layers for finetuning. Experimental results show that OFA achieves new state-of-the-arts on a series of multimodal tasks, including image captioning (COCO test CIDEr: 149.6), text-to-image generation (COCO test FID: 10.5), VQA (test-std acc.: 80.02), SNLI-VE (test acc.: 90.20), and referring expression comprehension (RefCOCO / RefCOCO+ / RefCOCOg test acc.: 92.93 / 90.10 / 85.20). Through extensive analyses, we demonstrate that OFA reaches comparable performance with uni-modal pretrained models (e.g., BERT, MAE, MoCo v3, SimCLR v2, etc.) in uni-modal tasks, including NLU, NLG, and image classification, and it effectively transfers to unseen tasks and domains. Code shall be released soon.
Author Information
Peng Wang (Alibaba Group)
An Yang (Alibaba Group)
Rui Men (Alibaba Group)
Junyang Lin (Alibaba Group)
Shuai Bai (Alibaba Group)
Zhikang Li (DAMO Academy, Alibaba Group)
Jianxin Ma (Alibaba Group)
Chang Zhou (Alibaba Group)
Jingren Zhou (Alibaba Group)
Hongxia Yang (Alibaba Group)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Unifying Modalities, Tasks, and Architectures Through a Simple Sequence-to-Sequence Learning Framework »
Thu. Jul 21st through Fri the 22nd Room
More from the Same Authors
-
2022 Poster: Principled Knowledge Extrapolation with GANs »
Ruili Feng · Jie Xiao · Kecheng Zheng · Deli Zhao · Jingren Zhou · Qibin Sun · Zheng-Jun Zha -
2022 Poster: Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably) »
Yu Huang · Junyang Lin · Chang Zhou · Hongxia Yang · Longbo Huang -
2022 Spotlight: Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably) »
Yu Huang · Junyang Lin · Chang Zhou · Hongxia Yang · Longbo Huang -
2022 Spotlight: Principled Knowledge Extrapolation with GANs »
Ruili Feng · Jie Xiao · Kecheng Zheng · Deli Zhao · Jingren Zhou · Qibin Sun · Zheng-Jun Zha -
2021 Poster: Learning to Rehearse in Long Sequence Memorization »
Zhu Zhang · Chang Zhou · Jianxin Ma · Zhijie Lin · Jingren Zhou · Hongxia Yang · Zhou Zhao -
2021 Spotlight: Learning to Rehearse in Long Sequence Memorization »
Zhu Zhang · Chang Zhou · Jianxin Ma · Zhijie Lin · Jingren Zhou · Hongxia Yang · Zhou Zhao -
2021 Poster: Uncertainty Principles of Encoding GANs »
Ruili Feng · Zhouchen Lin · Jiapeng Zhu · Deli Zhao · Jingren Zhou · Zheng-Jun Zha -
2021 Spotlight: Uncertainty Principles of Encoding GANs »
Ruili Feng · Zhouchen Lin · Jiapeng Zhu · Deli Zhao · Jingren Zhou · Zheng-Jun Zha -
2021 Poster: KNAS: Green Neural Architecture Search »
Jingjing Xu · Liang Zhao · Junyang Lin · Rundong Gao · Xu SUN · Hongxia Yang -
2021 Spotlight: KNAS: Green Neural Architecture Search »
Jingjing Xu · Liang Zhao · Junyang Lin · Rundong Gao · Xu SUN · Hongxia Yang