Timezone: »

 
Poster
Generative Pretraining From Pixels
Mark Chen · Alec Radford · Rewon Child · Jeffrey K Wu · Heewoo Jun · David Luan · Ilya Sutskever

Tue Jul 14 10:00 AM -- 10:45 AM & Tue Jul 14 09:00 PM -- 09:45 PM (PDT) @ None #None

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.

Author Information

Mark Chen (OpenAI)
Alec Radford (OpenAI)
Rewon Child (OpenAI)
Jeffrey K Wu (OpenAI)
Heewoo Jun (OpenAI)
David Luan (OpenAI)
Ilya Sutskever (OpenAI)

More from the Same Authors