Model soups - HackMD

## Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time [`Mitchell Wortsman`](https://mitchellnw.github.io), [`Gabriel Ilharco`](https://gabrielilharco.com), _et al_. presented by [Albert M. Orozco Camacho](https://twitter.com/alorozco53) ---- <iframe width="900" height="500" src="https://www.youtube.com/embed/Zv71G3yXPi4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe> ---  # Preliminaries ---- -  Finetuning large models often works well... -  but it's not everything we need! -  Finetuning roughly consists in two steps: 1. finetune (train) models using many hyperparameter configurations, and 2. select the model that achieves the highest accuracy. ---- ## What happens to discarded models when finetuning? - _Ensembling_ models can outperform the best finetuned model. - It is known that finetuning may reduce OoD performance ([Radford et al., 2021](https://arxiv.org/abs/2103.00020), [Wortsman et al., 2021](https://arxiv.org/abs/2109.01903)). - i.e., zero-shot tasks are quite challenging ---- TODO: Recycle image ----  ## In this paper... The authors take an alternative way in the finetuning recipe, which is claimed to be more robust and more accurate. > **Finetuning (Soups) Recipe** > :male-cook: :female-cook: 1. finetune (train) models using many hyperparameter configurations, and 2. _average the weights of all models finetuned before_ Note: - averaging weights seems like a surprising method to improve performance - yet the literature indicates many success stories, due to LMC - but don't forget about model merging... ----  ### Paper Contributions ---- -  Greedy Soups - a sequential recipe where models are added to the soup only if they improve accuracy on held-out data ---- -  Experiments - using models like CLIP, ALIGN, ViT-G; - authors achieve 90.94% accuracy on IMAGENET, beating previous SOTA, - also requiring 25% less FLOPs; - soups can approach ensembling performance with no additional computational cost; - soups have applications in various distribution shift regimes. ----  <img src="https://www.moma.org/d/assets/W1siZiIsIjIwMTUvMTAvMjEvOTY0aWFsdm96Yl9zb3VwY2FuLmpwZyJdLFsicCIsImNvbnZlcnQiLCItcXVhbGl0eSA5MCAtcmVzaXplIDIwMDB4MjAwMFx1MDAzZSJdXQ/soupcan.jpg?sha=9a38fb887eb28928" alt="campbell's" width="50%" height="50%"> ---  # Method ----  The authors propose 3 recipes for model souping: 1. :martial_arts_uniform: **Uniform** 2. :money_mouth_face: **Greedy** 3. :books: **Learned** Note: 1. uniform soup is naïve averaging 2. greedy soup is the central, incremental algorithm 3. learned soup optimizes model interpolation weights with a gradient based approach ----  ## Setting ----  - Consider a neural network $f(x, \theta)$ with input data $x$ and parameters $\theta \in \mathbb{R}^d$. - Let $\theta = \text{FineTune}(\theta_0, h)$ be the parameters obtained after finetuning with pretrained weights $\theta_0$ and parameter configuration $h$. - We can list all parameter configurations $h_1,\ldots,h_k$ and distinguish each $\theta_i = \text{FineTune}(\theta_0, h_i)$. ----  - **Model Soups**: $$ \theta_S = \frac{1}{|S|} \sum_{i \in S} \theta_i \text{ ; where } S \subseteq \{1,\ldots,k\} $$ - :martial_arts_uniform: Uniform Soup: average for $S = \{1,\ldots,k\}$ ----  ### :money_mouth_face: Greedy Soup :female-cook: :male-cook: ![](https://i.imgur.com/Xl904DW.png) ----  ### :books: Learned Soup - Mix of coefficients for each of the ingredients on the held-out validation set. - Given loss $\ell$ and validation set $\{x_i, y_i\}_{i=1}^n$ the authors, exhaustively, find coefficients $\alpha \in \mathbb{R}^k$ and temperature scaling parameter $\beta$ via $$ \text{argmin}_{\alpha \in \mathbb{R}^k, \beta \in \mathbb{R}} \sum_{j=1}^n \ell \left(\beta \cdot f\left(x_j, \sum_{i=1}^k \alpha_i \theta_i \right), y_j \right) $$ ----  ![](https://i.imgur.com/1veiWT3.png) ----  ![pozole](https://blog.amigofoods.com/wp-content/uploads/2021/03/pozole-rojo-mexican-soup.jpg) ---  # Experiments ----  ## Models - _CLIP_ ([Radford, et al 2021](https://arxiv.org/abs/2103.00020)), _ALIGN_ ([Jia, et al 2021](https://arxiv.org/abs/2102.05918)), _BASIC_ ([Pham, et al 2021](https://arxiv.org/abs/2111.10050)). - trained with contrastive supervision from image-text pairs <img src="https://i0.wp.com/www.thepanthertech.com/wp-content/uploads/2023/01/Screenshot-2023-01-05-130218.png?resize=1296%2C747&ssl=1" alt="google vs openai" width="50%" height="50%"> ----  - _ViT-G/14_ ([Zhai, et al 2021](https://arxiv.org/abs/2106.04560v2)) - pretrained on JFT-3B ([Zhai, et al 2021](https://arxiv.org/abs/2106.04560v2)) - Transformer models for text classification, such as BERT ([Devlin, et al 2019](https://aclanthology.org/N19-1423/), [Raffel, et al 2020](https://jmlr.org/papers/v21/20-074.html)) ---- ### Initialization of the final layer 1. Via _linear probe_ (LP), that is, freeze almost all layers and finetune a model's head only. 2. Zero-shot: using the classifier produced by the text portion of CLIP or ALIGN. The authors claim that both methods produce similar trends, but LP is prefered among most experiments. ---- ### Error Landscapes  ![](https://i.imgur.com/rIoeRG7.png) ----  ### A Geometric Observation Models that form an angle closer to $90^\circ$ tend to lead to higher accuracy. ![](https://i.imgur.com/pYiAgcF.png) ---- ### Ensemble performance is correlated with _Soup_ performance  ![](https://i.imgur.com/VYJ90lP.png) ---- ### Comparison of :stew::stew::stew: vs individual models  <img src="https://i.imgur.com/e6u06bY.png" alt="comparison soups" width="50%" height="70%"><img src="https://i.imgur.com/Y66V3wt.png" alt="comparison soups" width="50%" height="70%"> Note: - Model soups improve accuracy over the best individual model when performing a large, random hyperparameter search for finetuning a CLIP ViT-B/32 model on ImageNet - greedy soup outperforms the best dingle model on bothe settings ---- ### Ablation Study  <img src="https://i.imgur.com/uaK3EFY.png" alt="comparison soups" width="60%" height="70%"> Note: - Model soups improve accuracy when fine-tuning on the diverse classification tasks WILDS-FMoW ---- ### Greedy Soup does better than the best ViT-G individual models  ![](https://i.imgur.com/hZiPXjY.png) ---- <img src="https://dishingouthealth.com/wp-content/uploads/2022/01/SpicyMisoRamen_Square.jpg" alt="ramen" width="60%" height="70%"> ---  # Epilogue & Interesting Stuff ----  <img src="https://i.imgur.com/PTnkkRI.jpg" alt="dog" width="70%" height="70%"> ----  ### Expected Soup vs Ensemble loss comparison - Authors obtain an algebraic expression for the expected difference between souping and ensembling models. ![](https://i.imgur.com/Ldp9Dts.png) ----  ### ~~Scope and Limitations~~ Caveats :necktie: - Experiments were done on large models, with pretrained heterogeneous datasets; - _what about low resource regimes?_ - Ensembles improve model calibration, that is, providing a confidence likelihood of prediction; - model soups don't show the same benefits. ----  <img src="https://i.imgur.com/JB8gSM2.jpg" alt="cereal" width="70%" height="70%"> #controversial