Timezone: »

Scaling Vision Transformers to 22 Billion Parameters
Mostafa Dehghani · Josip Djolonga · Basil Mustafa · Piotr Padlewski · Jonathan Heek · Justin Gilmer · Andreas Steiner · Mathilde Caron · Robert Geirhos · Ibrahim Alabdulmohsin · Rodolphe Jenatton · Lucas Beyer · Michael Tschannen · Anurag Arnab · Xiao Wang · Carlos Riquelme · Matthias Minderer · Joan Puigcerver · Utku Evci · Manoj Kumar · Sjoerd van Steenkiste · Gamaleldin Elsayed · Aravindh Mahendran · Fisher Yu · Avital Oliver · Fantine Huot · Jasmijn Bastings · Mark Collier · Alexey Gritsenko · Vighnesh N Birodkar · Cristina Vasconcelos · Yi Tay · Thomas Mensink · Alexander Kolesnikov · Filip Pavetic · Dustin Tran · Thomas Kipf · Mario Lucic · Xiaohua Zhai · Daniel Keysers · Jeremiah Harmsen · Neil Houlsby

Wed Jul 26 02:00 PM -- 03:30 PM (PDT) @ Exhibit Hall 1 #219

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

Author Information

Mostafa Dehghani
Josip Djolonga (Google)
Basil Mustafa (Google)
Piotr Padlewski (Google Deepmind)
Jonathan Heek (Google)
Justin Gilmer (Google Brain)
Andreas Steiner (Google)
Andreas Steiner

Computer vision research engineer at Google DeepMind. Previously worked in tropical medicine. Education background MD, bioelectronics.

Mathilde Caron (Google)
Robert Geirhos (Google DeepMind)
Ibrahim Alabdulmohsin (Google)
Rodolphe Jenatton (Google Research)
Lucas Beyer (Google Brain (Zürich))
Michael Tschannen (Google Brain)
Anurag Arnab (University of Oxford)
Xiao Wang (Google)
Carlos Riquelme (Google Brain)
Matthias Minderer (Google Research)
Joan Puigcerver (Google DeepMind)
Utku Evci (Google)
Manoj Kumar (Google Brain)
Sjoerd van Steenkiste (IDSIA)
Gamaleldin Elsayed (Google DeepMind)

Gamaleldin F. Elsayed is a Research Scientist at Google DeepMind interested in deep learning and computational neuroscience research. In particular, his research is focused on studying properties and problems of artificial neural networks and designing better machine learning models with inspiration from neuroscience. In 2017, he completed his PhD in Neuroscience from Columbia University at the Center for Theoretical Neuroscience. During his PhD, he contributed to the field of computational neuroscience through designing machine learning methods for identifying and validating structures in complex neural data. Prior to that, he completed his B.S. from The American University in Cairo with a major in Electronics Engineering and a minor in Computer Science, and earned M.S. degrees in electrical engineering from KAUST and Washington University in St. Louis. Before his Graduate studies, he was also a professional athlete and Olympian Fencer. He competed at The 2008 Olympic Games in Beijing with the Egyptian Saber team.

Aravindh Mahendran (Google)
Fisher Yu (ETH Zurich)
Avital Oliver (Bar Ilan University)
Fantine Huot (Google)
Jasmijn Bastings (Google)
Mark Collier (Google)
Alexey Gritsenko (Google)
Vighnesh N Birodkar (Google)
Cristina Vasconcelos
Yi Tay (Google)
Thomas Mensink (Google Research / University of Amsterdam)
Alexander Kolesnikov (Google Brain)
Filip Pavetic (Google)
Dustin Tran (Google Brain)
Thomas Kipf (Google DeepMind)
Mario Lucic (Google Brain)
Xiaohua Zhai (Google Brain)
Daniel Keysers (Google)
Jeremiah Harmsen (Google)

Jeremiah Harmsen joined Google in 2005 where he has founded efforts such as TensorFlow Hub, TensorFlow Serving and the Machine Learning Ninja Rotation. He focuses on creating the ideas, tools and people to help the world use machine learning. He currently leads the Applied Machine Intelligence group at Google AI Zurich. The team increases the impact of machine learning through consultancy, state-of-the-art infrastructure development, research and education. Jeremiah received a B.S. degree in electrical engineering and computer engineering (2001), a M.S. degree in electrical engineering (2003), a M.S. degree in mathematics (2005) and a Ph.D. in electrical engineering (2005) from Rensselaer Polytechnic Institute, Troy, NY. Jeremiah lives by the lake with his wife, son and daughter in Zurich, Switzerland.

Neil Houlsby (Google)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors