<!-- <img src="https://i.imgur.com/PP10FkM.png" alt="clip performance" width="40%" height="50%"> <img src="https://thdaily.s3-us-west-1.amazonaws.com/gif_20200719232646.gif" alt="clip performance" width="40%" height="50%"> ![](https://media.tenor.com/D12KYBUCOBAAAAAd/git-merge.gif) --> <!-- .slide: data-text-color="black" data-transition="zoom" --> ## Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time [`Mitchell Wortsman`](https://mitchellnw.github.io), [`Gabriel Ilharco`](https://gabrielilharco.com), _et al_. presented by [Albert M. Orozco Camacho](https://twitter.com/alorozco53) ---- <iframe width="900" height="500" src="https://www.youtube.com/embed/Zv71G3yXPi4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe> --- <!-- .slide: data-transition="zoom" data-background="red"--> # Preliminaries ---- - <!-- .element: class="fragment" --> Finetuning large models <i>often</i> works well... - <!-- .element: class="fragment" --> but it's not everything we need! - <!-- .element: class="fragment" --> Finetuning roughly consists in two steps: 1. finetune (train) models using many hyperparameter configurations, and 2. select the model that achieves the highest accuracy. ---- ## What happens to discarded models when finetuning? - _Ensembling_ models can outperform the best finetuned model. - It is known that finetuning may reduce OoD performance ([Radford et al., 2021](https://arxiv.org/abs/2103.00020), [Wortsman et al., 2021](https://arxiv.org/abs/2109.01903)). - i.e., zero-shot tasks are quite challenging ---- TODO: Recycle image ---- <!-- .slide: data-background="white"--> ## In this paper... <p style="text-align:left;"> The authors take an alternative way in the <b>finetuning recipe</b>, which is claimed to be <i>more robust</i> and <i>more accurate</i>. </p> > **Finetuning (Soups) Recipe** > :male-cook: :female-cook: <font size="5"> 1. finetune (train) models using many hyperparameter configurations, and 2. _average the weights of all models finetuned before_ </font> Note: - averaging weights seems like a surprising method to improve performance - yet the literature indicates many success stories, due to LMC - but don't forget about model merging... ---- <!-- .slide: data-background="yellow"--> ### Paper Contributions ---- - <!-- .element: class="fragment" --> <b>Greedy Soups</b> - a sequential recipe where models are added to the soup only if they improve accuracy on held-out data ---- - <!-- .element: class="fragment" --> <b>Experiments</b> - using models like CLIP, ALIGN, ViT-G; - authors achieve 90.94% accuracy on IMAGENET, beating previous SOTA, - also requiring 25% less FLOPs; - soups can approach ensembling performance with no additional computational cost; - soups have applications in various distribution shift regimes. ---- <!-- .slide: data-background="white" .slide: data-transition="zoom"--> <img src="https://www.moma.org/d/assets/W1siZiIsIjIwMTUvMTAvMjEvOTY0aWFsdm96Yl9zb3VwY2FuLmpwZyJdLFsicCIsImNvbnZlcnQiLCItcXVhbGl0eSA5MCAtcmVzaXplIDIwMDB4MjAwMFx1MDAzZSJdXQ/soupcan.jpg?sha=9a38fb887eb28928" alt="campbell's" width="50%" height="50%"> --- <!-- .slide: data-transition="zoom" data-background="blue"--> # Method ---- <!-- .slide: data-background="white"--> <p style="text-align:left;"> The authors propose 3 recipes for <i>model souping</i>: </p> 1. :martial_arts_uniform: **Uniform** 2. :money_mouth_face: **Greedy** 3. :books: **Learned** Note: 1. uniform soup is naïve averaging 2. greedy soup is the central, incremental algorithm 3. learned soup optimizes model interpolation weights with a gradient based approach ---- <!-- .slide: data-transition="fade" .slide: data-background="brown" --> ## Setting ---- <!-- .slide: data-background="white"--> - Consider a neural network $f(x, \theta)$ with input data $x$ and parameters $\theta \in \mathbb{R}^d$. - Let $\theta = \text{FineTune}(\theta_0, h)$ be the parameters obtained after finetuning with pretrained weights $\theta_0$ and parameter configuration $h$. - We can list all parameter configurations $h_1,\ldots,h_k$ and distinguish each $\theta_i = \text{FineTune}(\theta_0, h_i)$. ---- <!-- .slide: data-transition="zoom" data-background="white"--> - **Model Soups**: $$ \theta_S = \frac{1}{|S|} \sum_{i \in S} \theta_i \text{ ; where } S \subseteq \{1,\ldots,k\} $$ - :martial_arts_uniform: Uniform Soup: average for $S = \{1,\ldots,k\}$ ---- <!-- .slide: data-transition="zoom" data-background="white"--> ### :money_mouth_face: Greedy Soup :female-cook: :male-cook: ![](https://i.imgur.com/Xl904DW.png) ---- <!-- .slide: data-transition="zoom" data-background="white"--> ### :books: Learned Soup - Mix of coefficients for each of the ingredients on the held-out validation set. - Given loss $\ell$ and validation set $\{x_i, y_i\}_{i=1}^n$ the authors, exhaustively, find coefficients $\alpha \in \mathbb{R}^k$ and temperature scaling parameter $\beta$ via $$ \text{argmin}_{\alpha \in \mathbb{R}^k, \beta \in \mathbb{R}} \sum_{j=1}^n \ell \left(\beta \cdot f\left(x_j, \sum_{i=1}^k \alpha_i \theta_i \right), y_j \right) $$ ---- <!-- .slide: data-transition="zoom" data-background="white"--> ![](https://i.imgur.com/1veiWT3.png) ---- <!-- .slide: data-transition="fade" data-background="orange"--> ![pozole](https://blog.amigofoods.com/wp-content/uploads/2021/03/pozole-rojo-mexican-soup.jpg) --- <!-- .slide: data-background="purple"--> # Experiments ---- <!-- .slide: data-transition="fade" --> ## Models - _CLIP_ ([Radford, et al 2021](https://arxiv.org/abs/2103.00020)), _ALIGN_ ([Jia, et al 2021](https://arxiv.org/abs/2102.05918)), _BASIC_ ([Pham, et al 2021](https://arxiv.org/abs/2111.10050)). - trained with contrastive supervision from image-text pairs <img src="https://i0.wp.com/www.thepanthertech.com/wp-content/uploads/2023/01/Screenshot-2023-01-05-130218.png?resize=1296%2C747&ssl=1" alt="google vs openai" width="50%" height="50%"> ---- <!-- .slide: data-transition="fade" --> - _ViT-G/14_ ([Zhai, et al 2021](https://arxiv.org/abs/2106.04560v2)) - pretrained on JFT-3B ([Zhai, et al 2021](https://arxiv.org/abs/2106.04560v2)) - Transformer models for text classification, such as BERT ([Devlin, et al 2019](https://aclanthology.org/N19-1423/), [Raffel, et al 2020](https://jmlr.org/papers/v21/20-074.html)) ---- ### Initialization of the final layer 1. Via _linear probe_ (LP), that is, freeze almost all layers and finetune a model's head only. 2. Zero-shot: using the classifier produced by the text portion of CLIP or ALIGN. The authors claim that both methods produce similar trends, but LP is prefered among most experiments. ---- ### Error Landscapes <!-- .slide: data-transition="zoom" .slide: data-background="white"--> ![](https://i.imgur.com/rIoeRG7.png) ---- <!-- .slide: data-transition="zoom" .slide: data-background="white"--> ### A Geometric Observation <font size="5"><i>Models that form an angle closer to $90^\circ$ tend to lead to higher accuracy.</i></font> ![](https://i.imgur.com/pYiAgcF.png) ---- ### Ensemble performance is correlated with _Soup_ performance <!-- .slide: data-transition="zoom" .slide: data-background="white"--> ![](https://i.imgur.com/VYJ90lP.png) ---- ### Comparison of :stew::stew::stew: vs individual models <!-- .slide: data-transition="zoom" .slide: data-background="white"--> <img src="https://i.imgur.com/e6u06bY.png" alt="comparison soups" width="50%" height="70%"><img src="https://i.imgur.com/Y66V3wt.png" alt="comparison soups" width="50%" height="70%"> Note: - Model soups improve accuracy over the best individual model when performing a large, random hyperparameter search for finetuning a CLIP ViT-B/32 model on ImageNet - greedy soup outperforms the best dingle model on bothe settings ---- ### Ablation Study <!-- .slide: data-transition="zoom" .slide: data-background="white"--> <img src="https://i.imgur.com/uaK3EFY.png" alt="comparison soups" width="60%" height="70%"> Note: - Model soups improve accuracy when fine-tuning on the diverse classification tasks WILDS-FMoW ---- ### Greedy Soup does better than the best ViT-G individual models <!-- .slide: data-transition="zoom" .slide: data-background="white"--> ![](https://i.imgur.com/hZiPXjY.png) ---- <img src="https://dishingouthealth.com/wp-content/uploads/2022/01/SpicyMisoRamen_Square.jpg" alt="ramen" width="60%" height="70%"> --- <!-- .slide: data-transition="zoom" data-background="pink"--> # Epilogue & Interesting Stuff ---- <!-- .slide: data-background="cyan"--> <img src="https://i.imgur.com/PTnkkRI.jpg" alt="dog" width="70%" height="70%"> ---- <!-- .slide: data-transition="fade" data-background="white"--> ### Expected Soup vs Ensemble loss comparison - Authors obtain an algebraic expression for the expected difference between souping and ensembling models. ![](https://i.imgur.com/Ldp9Dts.png) ---- <!-- .slide: data-transition="fade" data-background="white"--> ### ~~Scope and Limitations~~ Caveats :necktie: - Experiments were done on large models, with pretrained heterogeneous datasets; - _what about low resource regimes?_ - Ensembles improve model calibration, that is, providing a confidence likelihood of prediction; - model soups don't show the same benefits. ---- <!-- .slide: data-transition="zoom" data-background="yellow"--> <img src="https://i.imgur.com/JB8gSM2.jpg" alt="cereal" width="70%" height="70%"> <b>#controversial</b>
{"metaMigratedAt":"2023-06-17T22:29:37.892Z","metaMigratedFrom":"YAML","title":"Model soups","breaks":true,"contributors":"[{\"id\":\"adb0403f-b4e6-4ebc-be17-cc638e9f5cfe\",\"add\":23593,\"del\":13569}]","description":"Mitchell Wortsman,Gabriel Ilharco, et al."}
    632 views