Timezone: »
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.
Author Information
Chaitanya Ryali (FAIR, Meta AI)
Yuan-Ting Hu (UIUC)
Daniel Bolya (Georgia Tech Meta)
Chen Wei (Johns Hopkins University)
Haoqi Fan (Facebook AI Research)
Po-Yao Huang (Facebook)
Vaibhav Aggarwal (Facebook)
Arkabandhu Chowdhury (Facebook)
Omid Poursaeed (Meta AI)
Judy Hoffman (Georgia Institute of Technology)
Jitendra Malik (University of California at Berkeley)
Yanghao Li
Christoph Feichtenhofer (Facebook)
Related Events (a corresponding poster, oral, or spotlight)
-
2023 Poster: Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles »
Thu. Jul 27th 12:00 -- 01:30 AM Room Exhibit Hall 1 #219
More from the Same Authors
-
2023 : ConceptEvo: Interpreting Concept Evolution in Deep Learning Training »
Haekyu Park · Seongmin Lee · Benjamin Hoover · Austin Wright · Omar Shaikh · Rahul Duggal · Nilaksh Das · Kevin Li · Judy Hoffman · Polo Chau -
2023 Poster: Surface Snapping Optimization Layer for Single Image Object Shape Reconstruction »
Yuan-Ting Hu · Alex Schwing · Raymond A. Yeh -
2023 Tutorial: Self-Supervised Learning in Vision: from Research Advances to Best Practices »
Xinlei Chen · Ishan Misra · Randall Balestriero · Mathilde Caron · Christoph Feichtenhofer · Mark Ibrahim -
2022 Poster: Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging »
Anastasios Angelopoulos · Amit Pal Kohli · Stephen Bates · Michael Jordan · Jitendra Malik · Thayer Alshaabi · Srigokul Upadhyayula · Yaniv Romano -
2022 Spotlight: Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging »
Anastasios Angelopoulos · Amit Pal Kohli · Stephen Bates · Michael Jordan · Jitendra Malik · Thayer Alshaabi · Srigokul Upadhyayula · Yaniv Romano -
2021 Poster: Differentiable Spatial Planning using Transformers »
Devendra Singh Chaplot · Deepak Pathak · Jitendra Malik -
2021 Spotlight: Differentiable Spatial Planning using Transformers »
Devendra Singh Chaplot · Deepak Pathak · Jitendra Malik -
2020 Poster: Which Tasks Should Be Learned Together in Multi-task Learning? »
Trevor Standley · Amir Zamir · Dawn Chen · Leonidas Guibas · Jitendra Malik · Silvio Savarese -
2020 Poster: Deep Isometric Learning for Visual Recognition »
Haozhi Qi · Chong You · Xiaolong Wang · Yi Ma · Jitendra Malik -
2017 Poster: Fast k-Nearest Neighbour Search via Prioritized DCI »
Ke Li · Jitendra Malik -
2017 Talk: Fast k-Nearest Neighbour Search via Prioritized DCI »
Ke Li · Jitendra Malik