Timezone: »
The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs---we call the results “model soups.” When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.
Author Information
Mitchell Wortsman (University of Washington)
Gabriel Ilharco (University of Washington)
Samir Gadre (Columbia University)
Becca Roelofs (Google Research)
Raphael Gontijo Lopes (Google Brain)
Ari Morcos (Facebook AI Research (FAIR))
Hongseok Namkoong (Columbia University)
Ali Farhadi (University of Washington, Allen Institue for AI)
Yair Carmon (Tel Aviv University)
Simon Kornblith (Google Brain)
Ludwig Schmidt (University of Washington)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Spotlight: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time »
Thu. Jul 21st 08:00 -- 08:05 PM Room Hall F
More from the Same Authors
-
2021 : On the Origins of the Block Structure Phenomenon in Neural Network Representations »
Thao Nguyen · Maithra Raghu · Simon Kornblith -
2022 : When does dough become a bagel?Analyzing the remaining mistakes on ImageNet »
Vijay Vasudevan · Benjamin Caine · Raphael Gontijo Lopes · Sara Fridovich-Keil · Becca Roelofs -
2022 : Diagnosing Model Performance Under Distribution Shift »
Tianhui Cai · Hongseok Namkoong · Steve Yadlowsky -
2022 : How well do contrastively trained models transfer? »
M. Moein Shariatnia · Rahim Entezari · Mitchell Wortsman · Olga Saukh · Ludwig Schmidt -
2022 : On the Connection between Pre-training Data Diversity and Robustness »
Vivek Ramanujan · Vivek Ramanujan · Thao Nguyen · Thao Nguyen · Ludwig Schmidt · Ali Farhadi · Ali Farhadi -
2023 : Improving multimodal datasets with image captioning »
Thao Nguyen · · Gabriel Ilharco · Sewoong Oh · Ludwig Schmidt -
2023 : SemDeDup: Data-efficient learning at web-scale through semantic deduplication »
Amro Abbas · Daniel Simig · Surya Ganguli · Ari Morcos · Kushal Tirumala -
2023 : D4: Document Deduplication and Diversification »
Kushal Tirumala · Daniel Simig · Armen Aghajanyan · Ari Morcos -
2023 : Dynamic Control of Queuing Networks via Differentiable Discrete-Event Simulation »
Ethan Che · Hongseok Namkoong · Jing Dong -
2023 Poster: DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule »
Maor Ivgi · Oliver Hinder · Yair Carmon -
2023 Poster: On the Relationship Between Explanation and Prediction: A Causal View »
Amir-Hossein Karimi · Krikamol Muandet · Simon Kornblith · Bernhard Schölkopf · Been Kim -
2023 Poster: Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond »
Itai Kreisler · Mor Shpigel Nacson · Daniel Soudry · Yair Carmon -
2022 : Contributed Talk 1: When does dough become a bagel?Analyzing the remaining mistakes on ImageNet »
Vijay Vasudevan · Benjamin Caine · Raphael Gontijo Lopes · Sara Fridovich-Keil · Becca Roelofs -
2022 Poster: COAT: Measuring Object Compositionality in Emergent Representations »
Sirui Xie · Ari Morcos · Song-Chun Zhu · Shanmukha Ramakrishna Vedantam -
2022 Poster: Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP) »
Alex Fang · Gabriel Ilharco · Mitchell Wortsman · Yuhao Wan · Vaishaal Shankar · Achal Dave · Ludwig Schmidt -
2022 Poster: RECAPP: Crafting a More Efficient Catalyst for Convex Optimization »
Yair Carmon · Arun Jambulapati · Yujia Jin · Aaron Sidford -
2022 Spotlight: COAT: Measuring Object Compositionality in Emergent Representations »
Sirui Xie · Ari Morcos · Song-Chun Zhu · Shanmukha Ramakrishna Vedantam -
2022 Spotlight: Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP) »
Alex Fang · Gabriel Ilharco · Mitchell Wortsman · Yuhao Wan · Vaishaal Shankar · Achal Dave · Ludwig Schmidt -
2022 Spotlight: RECAPP: Crafting a More Efficient Catalyst for Convex Optimization »
Yair Carmon · Arun Jambulapati · Yujia Jin · Aaron Sidford -
2021 Poster: CURI: A Benchmark for Productive Concept Learning Under Uncertainty »
Shanmukha Ramakrishna Vedantam · Arthur Szlam · Maximilian Nickel · Ari Morcos · Brenden Lake -
2021 Spotlight: CURI: A Benchmark for Productive Concept Learning Under Uncertainty »
Shanmukha Ramakrishna Vedantam · Arthur Szlam · Maximilian Nickel · Ari Morcos · Brenden Lake -
2021 Poster: Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization »
John Miller · Rohan Taori · Aditi Raghunathan · Shiori Sagawa · Pang Wei Koh · Vaishaal Shankar · Percy Liang · Yair Carmon · Ludwig Schmidt -
2021 Spotlight: Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization »
John Miller · Rohan Taori · Aditi Raghunathan · Shiori Sagawa · Pang Wei Koh · Vaishaal Shankar · Percy Liang · Yair Carmon · Ludwig Schmidt -
2021 Poster: Generalised Lipschitz Regularisation Equals Distributional Robustness »
Zac Cranko · Zhan Shi · Xinhua Zhang · Richard Nock · Simon Kornblith -
2021 Spotlight: Generalised Lipschitz Regularisation Equals Distributional Robustness »
Zac Cranko · Zhan Shi · Xinhua Zhang · Richard Nock · Simon Kornblith -
2021 Poster: Learning Neural Network Subspaces »
Mitchell Wortsman · Maxwell Horton · Carlos Guestrin · Ali Farhadi · Mohammad Rastegari -
2021 Spotlight: Learning Neural Network Subspaces »
Mitchell Wortsman · Maxwell Horton · Carlos Guestrin · Ali Farhadi · Mohammad Rastegari -
2021 Poster: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases »
Stéphane d'Ascoli · Hugo Touvron · Matthew Leavitt · Ari Morcos · Giulio Biroli · Levent Sagun -
2021 Spotlight: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases »
Stéphane d'Ascoli · Hugo Touvron · Matthew Leavitt · Ari Morcos · Giulio Biroli · Levent Sagun -
2020 Poster: Soft Threshold Weight Reparameterization for Learnable Sparsity »
Aditya Kusupati · Vivek Ramanujan · Raghav Somani · Mitchell Wortsman · Prateek Jain · Sham Kakade · Ali Farhadi -
2020 Poster: Revisiting Spatial Invariance with Low-Rank Local Connectivity »
Gamaleldin Elsayed · Prajit Ramachandran · Jon Shlens · Simon Kornblith -
2020 : Engagement and Solidarity with Global Queer Communities »
Raphael Gontijo Lopes · Bisi Alimi · Faris Gezahegn · Ida Momennejad · Tan Zhi-Xuan -
2020 Poster: A Simple Framework for Contrastive Learning of Visual Representations »
Ting Chen · Simon Kornblith · Mohammad Norouzi · Geoffrey Hinton -
2020 Affinity Workshop: Queer in AI »
ST John · William Agnew · Anja Meunier · Alex Markham · Manu Saraswat · Andrew McNamara · Raphael Gontijo Lopes -
2019 Workshop: Identifying and Understanding Deep Learning Phenomena »
Hanie Sedghi · Samy Bengio · Kenji Hata · Aleksander Madry · Ari Morcos · Behnam Neyshabur · Maithra Raghu · Ali Rahimi · Ludwig Schmidt · Ying Xiao -
2019 : Spotlight »
Tyler Scott · Kiran Thekumparampil · Jonathan Aigrain · Rene Bidart · Priyadarshini Panda · Dian Ang Yap · Yaniv Yacoby · Raphael Gontijo Lopes · Alberto Marchisio · Erik Englesson · Wanqian Yang · Moritz Graule · Yi Sun · Daniel Kang · Mike Dusenberry · Min Du · Hartmut Maennel · Kunal Menda · Vineet Edupuganti · Luke Metz · David Stutz · Vignesh Srinivasan · Timo Sämann · Vineeth N Balasubramanian · Sina Mohseni · Rob Cornish · Judith Butepage · Zhangyang Wang · Bai Li · Bo Han · Honglin Li · Maksym Andriushchenko · Lukas Ruff · Meet P. Vadera · Yaniv Ovadia · Sunil Thulasidasan · Disi Ji · Gang Niu · Saeed Mahloujifar · Aviral Kumar · SANGHYUK CHUN · Dong Yin · Joyce Xu Xu · Hugo Gomes · Raanan Rohekar -
2019 Poster: Similarity of Neural Network Representations Revisited »
Simon Kornblith · Mohammad Norouzi · Honglak Lee · Geoffrey Hinton -
2019 Oral: Similarity of Neural Network Representations Revisited »
Simon Kornblith · Mohammad Norouzi · Honglak Lee · Geoffrey Hinton -
2018 Poster: Measuring abstract reasoning in neural networks »
Adam Santoro · Feilx Hill · David GT Barrett · Ari S Morcos · Timothy Lillicrap -
2018 Oral: Measuring abstract reasoning in neural networks »
Adam Santoro · Feilx Hill · David GT Barrett · Ari S Morcos · Timothy Lillicrap