<!-- .slide: data-background-image="" -->
## Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis
[`Stefan Horoi`](https://your_link_here),
[`Albert M. Orozco Camacho`](https://your_link_here),
[`Eugene Belilovsky`](https://your_link_here),
[`Guy Wolf`](https://your_link_here).
presented by [Albert](https://alorozco53.github.io)
----
<!-- .slide: data-transition="fade" data-background="white"-->
<img src="https://hackmd.io/_uploads/rJA50osmC.png" alt="" width="625">
---
<!-- .slide: data-transition="zoom" data-background="red"-->
## Motivation
----
- Merging models via permutation alignment seems cool... but is there really a 1-2-1 correspondence between learned features even with different initializations?
- Our paper introduces **CCA Merge**, a method based on Canonical Correlation Analysis (CCA) for merging neural networks.
- We maximize correlations between linear combinations of features and show how to extend to multiple model merging.
----
<!-- .slide: data-transition="fade" data-background="white"-->
## Linear Mode Connectivity
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/BJbXmRj70.png" style="width:70%;">
<p><em><small>Taken from <a href="https://arxiv.org/abs/1802.10026" target="_blank">Garipov et al., 2018</a></small>
</em></p>
</div>
<div style="display: flex; justify-content: space-between;">
<div style="width: 45%; text-align: left;">
<img src="https://hackmd.io/_uploads/r1lRL0s70.png" style="width:50%;">
<p><em><small>Taken from <a href="https://arxiv.org/abs/1803.00885" target="_blank">Draxler et al., 2019</a></small>
</em></p>
</div>
<div style="width: 45%; text-align: right;">
<img src="https://hackmd.io/_uploads/BkM0dRi7C.png" style="width:50%;">
<p><em><small>Taken from <a href="https://arxiv.org/abs/2209.04836" target="_blank">Ainsworth et al., 2022</a></small>
</em></p>
</div>
</div>
----
<!-- .slide: data-background="yellow"-->
### Paper Contributions
- <!-- .element: class="fragment" --> <b>Flexible Model Merging</b>
- CCA Merge maximizes correlations between neuron activations.
- <!-- .element: class="fragment" --> <b>Improved Performance</b>
- Outperforms previous methods in various scenarios.
- <!-- .element: class="fragment" --> <b>Scalable to Multiple Models</b>
- Effective in merging more than two models without performance degradation.
---
<!-- .slide: data-transition="zoom" data-background="blue"-->
## Theoretical Framework
----
<!-- .slide: data-transition="zoom" data-background="gray" -->
### CCA Merge: Canonical Correlation Analysis
- **CCA**: Finds linear transformations that maximize correlations between two sets of variables.
- Aligns features from different models to a common representation space.
Note: This method is more flexible than traditional permutation-based methods.
----
<!-- .slide: data-background="white" -->
### CCA Merging Method
1. **Align Activations**: Compute CCA projection matrices $P_A$and $P_B$ for models A and B.
2. **Transform Parameters**: Apply these matrices to model parameters to align them.
3. **Merge Parameters**: Average the transformed parameters to merge the models.
Note: This approach captures complex relationships between neurons better than permutations.
----
<!-- .slide: data-transition="fade" data-background="white" -->
### Merging Models: Problem Definition
Consider two neural networks A and B with the same architecture.
- **Layer $L_i$in model M**:
- Weights: $W_i^M \in \mathbb{R}^{n_i \times n_{i-1}}$
- Bias: $b_i^M \in \mathbb{R}^{n_i}$
- Activation: $x_i^M = \sigma\left(W_i^M x_{i-1}^M + b_i^M\right)$
To merge models, we align parameters layer by layer using invertible linear transformations $T_i$.
----
<!-- .slide: data-transition="fade" data-background="white" -->
### Transforming and Merging
1. **Transform layer $L_i$in model B**:
- $x_i^B = \sigma\left(T_i W_i^B x_{i-1}^B + T_i b_i^B\right)$
2. **Adjust the next layer to maintain consistency**:
- $x_{i+1}^B = \sigma\left(W_{i+1}^B T_i^{-1} x_i^B + b_{i+1}^B\right)$
3. After finding $T_i$ for each layer, merge the parameters:
$W_i = \frac{1}{2} \left(W_i^A + T_i W_i^B T_{i-1}^{-1}\right)$
----
<!-- .slide: data-transition="zoom" data-background="white" -->
## Merging Multiple Models
_All-to-one_
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/S1ed1lnmA.png" style="width:80%;">
</div>
----
<!-- .slide: data-transition="fade" data-background="white" -->
### Canonical Correlation Analysis (CCA)
- **Objective**: Find linear transformations that maximize the correlation between the projections of two datasets.
- **Datasets**:
- $X \in \mathbb{R}^{n \times d_1}$: First dataset
- $Y \in \mathbb{R}^{n \times d_2}$: Second dataset
The goal is to find vectors $w_X$ and $w_Y$ such that the projections $X w_X$ and $Y w_Y$ are maximally correlated.
----
<!-- .slide: data-transition="fade" data-background="white" -->
### Computing CCA Projections
1. **Scatter Matrices**:
- $S_{XX} = X^\top X$
- $S_{YY} = Y^\top Y$
- $S_{XY} = X^\top Y$
2. **Optimization Objective**:
$$\max_{w_X, w_Y} \frac{w_X^\top S_{XY} w_Y}{\sqrt{w_X^\top S_{XX} w_X} \sqrt{w_Y^\top S_{YY} w_Y}}$$
----
<!-- .slide: data-background="white" -->
3. **Solution via Singular Value Decomposition (SVD)**:
- Perform SVD on $S_{XX}^{-1/2} S_{XY} S_{YY}^{-1/2}$:
$$U, \Sigma, V^\top = \text{SVD}\left(S_{XX}^{-1/2} S_{XY} S_{YY}^{-1/2}\right)$$
- Projections:
$P_X = S_{XX}^{-1/2} U$
$P_Y = S_{YY}^{-1/2} V$
----
<!-- .slide: data-transition="zoom"-->
### Transformation Matrix for Model Merging
Using CCA to align and merge neural network models.
1. **Compute Projections**:
- For activations $X_i^A$ and $X_i^B$ from models $A$ and $B$, compute $P_A, P_B$
----
2. **Transformation Matrix**:
- Transform model $B$'s parameters to align with model $A$:
$T_i = \left(P_B P_A^{-1}\right)^\top$
3. **Merge Parameters**:
- Merge the aligned parameters of models A and B:
$W_i = \frac{1}{2} \left(W_i^A + T_i W_i^B T_{i-1}^{-1}\right)$
Note: This method captures complex relationships between neurons better than permutation-based methods.
----
<!-- .slide: data-transition="zoom" .slide: data-background="white"-->

---
<!-- .slide: data-transition="zoom" data-background="pink"-->
### Experiments
----
<!-- .slide: data-background="white" -->
### Datasets
CIFAR-10, CIFAR-100, and ImageNet-200

----
<!-- .slide: data-transition="fade" data-background="white" -->
#### VGG and ResNet Models
<div style="display: flex; justify-content: center; align-items: center;">
<img src="https://production-media.paperswithcode.com/methods/vgg_7mT4DML.png" style="width:45%;">
<div style="margin-left: 10px;">
<em><small>Taken from <a href="https://paperswithcode.com/method/vgg" target="_blank">here</a></small>
</em>
</div>
</div>
<div style="display: flex; justify-content: center; align-items: center;">
<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*C8jf92MeHZnxnbpMkz6jkQ.png" style="width:45%;">
<div style="margin-left: 10px;">
<em><small>Taken from <a href="https://medium.com/@siddheshb008/resnet-architecture-explained-47309ea9283d" target="_blank">here</a></small>
</em>
</div>
</div>
----
<!-- .slide: data-transition="zoom" data-background="brown"-->
## Results
----
<!-- .slide: data-transition="zoom" data-background="white" -->
### Accuracy Improvement
<div style="text-align: center;">
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/Bkg4Ve27C.png" style="width:70%;">
</div>
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/H1yAre3XA.png" style="width:50%;">
<em><small><p>Top and botom: CCA Merge significantly outperforms other baselines. Bottom: ResNet18x4 on ImageNet2000</p></small>
</em>
</div>
</div>
----
<!-- .slide: data-transition="fade" data-background="white" -->
### CCA Merging Finds Better Common Representations Between Many Models

----
<!-- .slide: data-transition="zoom" -->
### Flexibility of CCA
Consider merging two models, $A$ and $B$, at a specific layer with neuron sets $\left\{z_i^M\right\}_{i=1}^n$ for $M \in \left\{A, B\right\}$.
- **For our correlation Matrix $C$**:
- Element $C_{ij}$: Correlation between neuron $z_i^A$ in model A and neuron $z_j^B$ in model B.
- We would like to analyze the distribution of correlations for each neuron $z_i^A$.
----
- **Permutation Hypothesis**:
- 1-2-1 mapping between neurons of models $A$ and $B$.
- High correlation for a single neuron and near-zero for others.
- **CCA Merge Hypothesis**:
- Captures relationships where multiple neurons from model $B$ are highly correlated with $z_i^A$. Features are learned across multiple neurons in model $B$.
----
<!-- .slide: data-transition="fade" data-background="white" -->
### Empirical Evidence of CCA’s Flexibility

----
<!-- .slide: data-transition="zoom" data-background="white" -->
## Distribution of Correlation and CCA Merge Coefficients
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/SycIJbhmC.png" style="width:55%;">
<em><small>Figure 3: Distributions of top 1 (left) and 2 (right) correlations (blue) and CCA Merge transformation coefficients (orange) across neurons from model A at two merging layers. Wasserstein distance between the distributions of top k correlations and top k Merge CCA coefficients are reported.</small>
</em>
</div>
----
<!-- .slide: data-transition="zoom" data-background="white" -->
## CCA Merge is better in OoD Settings

---
<!-- .slide: data-transition="zoom" data-background="green" -->
# Current Work: Pairwise Merging
----
- <!-- .element: class="fragment" --> <b>LMC</b> implies a linear path between two minima subject to adequate permutations.
- <!-- .element: class="fragment" --> Interpolating via equal contribution per model might not be optimal.
- <!-- .element: class="fragment" --> Individual features from Model A might be lost after averaging with Model B.
----
<!-- .slide: data-transition="zoom" data-background="white" -->
<div style="text-align: center;">
<img src="https://github.com/samuela/git-re-basin/blob/main/mnist_video.gif?raw=true" style="width:90%;">
</div>
----
<!-- .slide: data-transition="zoom" data-background="cyan" -->
## Non-uniform Parameter-wise model merging
$$W_i = \mathbf{\alpha}_i \odot W_i^A + (\mathbf{1} - \mathbf{\alpha}_i) \odot W_i^{B'}$$
where $\odot$ is the elementwise product and $\mathbf{1}$ is the matrix of ones of the necessary size. We use gradient-based optimization to learn the elements of $\mathbf{\alpha} = \bigcup_{i=1} \mathbf{\alpha}_i$.
----
<!-- .slide: data-transition="fade" data-background="white" -->
<div style="text-align: center;">
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/BkWEQSnQR.png" style="width:60%;">
</div>
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/BkkBQr2QA.png" style="width:60%;">
</div>
</div>
----
<!-- .slide: data-transition="fade" data-background="white" -->
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/ByREVB3XA.png" style="width:90%;">
</div>
----
<!-- .slide: data-transition="zoom" data-background="white" -->
## Merging Multiple Models
_Hierarchical Pairwise_

----
<!-- .slide: data-transition="fade" data-background="white" -->

----
<!-- .slide: data-transition="fade" data-background="white" -->
## Feature Interpretation

----
## What about initialization?
- Idea: Fisher Merging ([Matena, Raffel 2021](https://arxiv.org/abs/2111.09832), [Dhawan, et al. 2023](https://arxiv.org/abs/2311.10291))
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/Sk38JLhQ0.png" style="width:80%;">
</div>
---
<!-- .slide: data-transition="zoom" data-background="purple"-->
## Conclusion and Takeaways
----
<!-- slide: data-transition="fade" .slide: data-background="white" -->
### Key Takeaways
----
<!-- slide: data-transition="fade" -->
### Potential Research Paths
Future work could address:
----
<!-- slide: data-transition="fade" .slide: data-background="aquamarine" -->
### Points for Reflection
(Courtesy of ChatGPT)
- The ethical implications of more accessible models.
- How CCA Merge's parameter efficiency might influence the democratization of AI.
- The long-term impact of such techniques on the development cycle of models.
{"description":"Le Yu,Bowen Yu,Haiyang Yu,Fei Huang,Yongbin Li.","showTags":"false","contributors":"[{\"id\":\"adb0403f-b4e6-4ebc-be17-cc638e9f5cfe\",\"add\":23751,\"del\":10554}]","title":"Harmony in Diversity"}