Timezone: »

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Peng Wang · An Yang · Rui Men · Junyang Lin · Shuai Bai · Zhikang Li · Jianxin Ma · Chang Zhou · Jingren Zhou · Hongxia Yang

Thu Jul 21 10:40 AM -- 10:45 AM (PDT) @ Ballroom 1 & 2

In this work, we pursue a unified paradigm for multimodal pretraining to break the shackles of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.

Author Information

Peng Wang (Alibaba Group)
An Yang (Alibaba Group)
Rui Men (Alibaba Group)
Junyang Lin (Alibaba Group)
Shuai Bai (Alibaba Group)
Zhikang Li (DAMO Academy, Alibaba Group)
Jianxin Ma (Alibaba Group)
Chang Zhou (Alibaba Group)
Jingren Zhou (Alibaba Group)
Hongxia Yang (Alibaba Group)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors