owned this note
owned this note
Published
Linked with GitHub
---
title: Model soups
tags: merging, distributed learning, continual learning
---
<!--
<img src="https://i.imgur.com/PP10FkM.png"
alt="clip performance"
width="40%"
height="50%"> <img src="https://thdaily.s3-us-west-1.amazonaws.com/gif_20200719232646.gif"
alt="clip performance"
width="40%"
height="50%">

-->
<!-- .slide: data-text-color="black" data-transition="zoom" -->
## Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
[`Mitchell Wortsman`](https://mitchellnw.github.io),
[`Gabriel Ilharco`](https://gabrielilharco.com), _et al_.
presented by [Albert M. Orozco Camacho](https://twitter.com/alorozco53)
----
<iframe width="900" height="500" src="https://www.youtube.com/embed/Zv71G3yXPi4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
---
<!-- .slide: data-transition="zoom" data-background="red"-->
# Preliminaries
----
- <!-- .element: class="fragment" --> Finetuning large models <i>often</i> works well...
- <!-- .element: class="fragment" --> but it's not everything we need!
- <!-- .element: class="fragment" --> Finetuning roughly consists in two steps:
1. finetune (train) models using many hyperparameter configurations, and
2. select the model that achieves the highest accuracy.
----
## What happens to discarded models when finetuning?
- _Ensembling_ models can outperform the best finetuned model.
- It is known that finetuning may reduce OoD performance ([Radford et al., 2021](https://arxiv.org/abs/2103.00020), [Wortsman et al., 2021](https://arxiv.org/abs/2109.01903)).
- i.e., zero-shot tasks are quite challenging
----
TODO: Recycle image
----
<!-- .slide: data-background="white"-->
## In this paper...
<p style="text-align:left;">
The authors take an alternative way in the <b>finetuning recipe</b>, which is claimed
to be <i>more robust</i> and <i>more accurate</i>.
</p>
> **Finetuning (Soups) Recipe**
> :male-cook: :female-cook:
<font size="5">
1. finetune (train) models using many hyperparameter configurations, and
2. _average the weights of all models finetuned before_
</font>
Note:
- averaging weights seems like a surprising method to improve performance
- yet the literature indicates many success stories, due to LMC
- but don't forget about model merging...
----
<!-- .slide: data-background="yellow"-->
### Paper Contributions
----
- <!-- .element: class="fragment" --> <b>Greedy Soups</b>
- a sequential recipe where models are added to the soup only if they improve accuracy on held-out data
----
- <!-- .element: class="fragment" --> <b>Experiments</b>
- using models like CLIP, ALIGN, ViT-G;
- authors achieve 90.94% accuracy on IMAGENET, beating previous SOTA,
- also requiring 25% less FLOPs;
- soups can approach ensembling performance with no additional computational cost;
- soups have applications in various distribution shift regimes.
----
<!-- .slide: data-background="white" .slide: data-transition="zoom"-->
<img src="https://www.moma.org/d/assets/W1siZiIsIjIwMTUvMTAvMjEvOTY0aWFsdm96Yl9zb3VwY2FuLmpwZyJdLFsicCIsImNvbnZlcnQiLCItcXVhbGl0eSA5MCAtcmVzaXplIDIwMDB4MjAwMFx1MDAzZSJdXQ/soupcan.jpg?sha=9a38fb887eb28928"
alt="campbell's"
width="50%"
height="50%">
---
<!-- .slide: data-transition="zoom" data-background="blue"-->
# Method
----
<!-- .slide: data-background="white"-->
<p style="text-align:left;">
The authors propose 3 recipes for <i>model souping</i>:
</p>
1. :martial_arts_uniform: **Uniform**
2. :money_mouth_face: **Greedy**
3. :books: **Learned**
Note:
1. uniform soup is naïve averaging
2. greedy soup is the central, incremental algorithm
3. learned soup optimizes model interpolation weights with a gradient based approach
----
<!-- .slide: data-transition="fade" .slide: data-background="brown" -->
## Setting
----
<!-- .slide: data-background="white"-->
- Consider a neural network $f(x, \theta)$ with input data $x$ and parameters $\theta \in \mathbb{R}^d$.
- Let $\theta = \text{FineTune}(\theta_0, h)$ be the parameters obtained after finetuning with pretrained weights $\theta_0$ and parameter configuration $h$.
- We can list all parameter configurations $h_1,\ldots,h_k$ and distinguish each $\theta_i = \text{FineTune}(\theta_0, h_i)$.
----
<!-- .slide: data-transition="zoom" data-background="white"-->
- **Model Soups**:
$$
\theta_S = \frac{1}{|S|} \sum_{i \in S} \theta_i
\text{ ; where } S \subseteq \{1,\ldots,k\}
$$
- :martial_arts_uniform: Uniform Soup: average for $S = \{1,\ldots,k\}$
----
<!-- .slide: data-transition="zoom" data-background="white"-->
### :money_mouth_face: Greedy Soup :female-cook: :male-cook:

----
<!-- .slide: data-transition="zoom" data-background="white"-->
### :books: Learned Soup
- Mix of coefficients for each of the ingredients on the held-out validation set.
- Given loss $\ell$ and validation set $\{x_i, y_i\}_{i=1}^n$ the authors, exhaustively, find coefficients $\alpha \in \mathbb{R}^k$ and temperature scaling parameter $\beta$ via
$$
\text{argmin}_{\alpha \in \mathbb{R}^k, \beta \in \mathbb{R}}
\sum_{j=1}^n \ell \left(\beta \cdot f\left(x_j, \sum_{i=1}^k \alpha_i \theta_i
\right), y_j \right)
$$
----
<!-- .slide: data-transition="zoom" data-background="white"-->

----
<!-- .slide: data-transition="fade" data-background="orange"-->

---
<!-- .slide: data-background="purple"-->
# Experiments
----
<!-- .slide: data-transition="fade" -->
## Models
- _CLIP_ ([Radford, et al 2021](https://arxiv.org/abs/2103.00020)), _ALIGN_ ([Jia, et al 2021](https://arxiv.org/abs/2102.05918)), _BASIC_ ([Pham, et al 2021](https://arxiv.org/abs/2111.10050)).
- trained with contrastive supervision from image-text pairs
<img src="https://i0.wp.com/www.thepanthertech.com/wp-content/uploads/2023/01/Screenshot-2023-01-05-130218.png?resize=1296%2C747&ssl=1"
alt="google vs openai"
width="50%"
height="50%">
----
<!-- .slide: data-transition="fade" -->
- _ViT-G/14_ ([Zhai, et al 2021](https://arxiv.org/abs/2106.04560v2))
- pretrained on JFT-3B ([Zhai, et al 2021](https://arxiv.org/abs/2106.04560v2))
- Transformer models for text classification, such as BERT ([Devlin, et al 2019](https://aclanthology.org/N19-1423/), [Raffel, et al 2020](https://jmlr.org/papers/v21/20-074.html))
----
### Initialization of the final layer
1. Via _linear probe_ (LP), that is, freeze almost all layers and finetune a model's head only.
2. Zero-shot: using the classifier produced by the text portion of CLIP or ALIGN.
The authors claim that both methods produce similar trends, but LP is prefered among most experiments.
----
### Error Landscapes
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->

----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
### A Geometric Observation
<font size="5"><i>Models that form an angle closer to $90^\circ$ tend to lead to higher accuracy.</i></font>

----
### Ensemble performance is correlated with _Soup_ performance
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->

----
### Comparison of :stew::stew::stew: vs individual models
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
<img src="https://i.imgur.com/e6u06bY.png"
alt="comparison soups"
width="50%"
height="70%"><img src="https://i.imgur.com/Y66V3wt.png"
alt="comparison soups"
width="50%"
height="70%">
Note:
- Model soups improve accuracy over the best individual model when performing a large, random hyperparameter search for finetuning a CLIP ViT-B/32 model on ImageNet
- greedy soup outperforms the best dingle model on bothe settings
----
### Ablation Study
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->
<img src="https://i.imgur.com/uaK3EFY.png"
alt="comparison soups"
width="60%"
height="70%">
Note:
- Model soups improve accuracy when fine-tuning on the diverse classification tasks WILDS-FMoW
----
### Greedy Soup does better than the best ViT-G individual models
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->

----
<img src="https://dishingouthealth.com/wp-content/uploads/2022/01/SpicyMisoRamen_Square.jpg"
alt="ramen"
width="60%"
height="70%">
---
<!-- .slide: data-transition="zoom" data-background="pink"-->
# Epilogue & Interesting Stuff
----
<!-- .slide: data-background="cyan"-->
<img src="https://i.imgur.com/PTnkkRI.jpg"
alt="dog"
width="70%"
height="70%">
----
<!-- .slide: data-transition="fade" data-background="white"-->
### Expected Soup vs Ensemble loss comparison
- Authors obtain an algebraic expression for the expected difference between souping and ensembling models.

----
<!-- .slide: data-transition="fade" data-background="white"-->
### ~~Scope and Limitations~~ Caveats :necktie:
- Experiments were done on large models, with pretrained heterogeneous datasets;
- _what about low resource regimes?_
- Ensembles improve model calibration, that is, providing a confidence likelihood of prediction;
- model soups don't show the same benefits.
----
<!-- .slide: data-transition="zoom" data-background="yellow"-->
<img src="https://i.imgur.com/JB8gSM2.jpg"
alt="cereal"
width="70%"
height="70%">
<b>#controversial</b>