Harmony in Diversity

## Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis [`Stefan Horoi`](https://your_link_here), [`Albert M. Orozco Camacho`](https://your_link_here), [`Eugene Belilovsky`](https://your_link_here), [`Guy Wolf`](https://your_link_here). presented by [Albert](https://alorozco53.github.io) ----  <img src="https://hackmd.io/_uploads/rJA50osmC.png" alt="" width="625"> ---  ## Motivation ---- - Merging models via permutation alignment seems cool... but is there really a 1-2-1 correspondence between learned features even with different initializations? - Our paper introduces **CCA Merge**, a method based on Canonical Correlation Analysis (CCA) for merging neural networks. - We maximize correlations between linear combinations of features and show how to extend to multiple model merging. ----  ## Linear Mode Connectivity <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJbXmRj70.png" style="width:70%;"> <p><em><small>Taken from <a href="https://arxiv.org/abs/1802.10026" target="_blank">Garipov et al., 2018</a></small> </em></p> </div> <div style="display: flex; justify-content: space-between;"> <div style="width: 45%; text-align: left;"> <img src="https://hackmd.io/_uploads/r1lRL0s70.png" style="width:50%;"> <p><em><small>Taken from <a href="https://arxiv.org/abs/1803.00885" target="_blank">Draxler et al., 2019</a></small> </em></p> </div> <div style="width: 45%; text-align: right;"> <img src="https://hackmd.io/_uploads/BkM0dRi7C.png" style="width:50%;"> <p><em><small>Taken from <a href="https://arxiv.org/abs/2209.04836" target="_blank">Ainsworth et al., 2022</a></small> </em></p> </div> </div> ----  ### Paper Contributions -  <b>Flexible Model Merging</b> - CCA Merge maximizes correlations between neuron activations. -  <b>Improved Performance</b> - Outperforms previous methods in various scenarios. -  <b>Scalable to Multiple Models</b> - Effective in merging more than two models without performance degradation. ---  ## Theoretical Framework ----  ### CCA Merge: Canonical Correlation Analysis - **CCA**: Finds linear transformations that maximize correlations between two sets of variables. - Aligns features from different models to a common representation space. Note: This method is more flexible than traditional permutation-based methods. ----  ### CCA Merging Method 1. **Align Activations**: Compute CCA projection matrices $P_A$and $P_B$ for models A and B. 2. **Transform Parameters**: Apply these matrices to model parameters to align them. 3. **Merge Parameters**: Average the transformed parameters to merge the models. Note: This approach captures complex relationships between neurons better than permutations. ----  ### Merging Models: Problem Definition Consider two neural networks A and B with the same architecture. - **Layer $L_i$in model M**: - Weights: $W_i^M \in \mathbb{R}^{n_i \times n_{i-1}}$ - Bias: $b_i^M \in \mathbb{R}^{n_i}$ - Activation: $x_i^M = \sigma\left(W_i^M x_{i-1}^M + b_i^M\right)$ To merge models, we align parameters layer by layer using invertible linear transformations $T_i$. ----  ### Transforming and Merging 1. **Transform layer $L_i$in model B**: - $x_i^B = \sigma\left(T_i W_i^B x_{i-1}^B + T_i b_i^B\right)$ 2. **Adjust the next layer to maintain consistency**: - $x_{i+1}^B = \sigma\left(W_{i+1}^B T_i^{-1} x_i^B + b_{i+1}^B\right)$ 3. After finding $T_i$ for each layer, merge the parameters: $W_i = \frac{1}{2} \left(W_i^A + T_i W_i^B T_{i-1}^{-1}\right)$ ----  ## Merging Multiple Models _All-to-one_ <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/S1ed1lnmA.png" style="width:80%;"> </div> ----  ### Canonical Correlation Analysis (CCA) - **Objective**: Find linear transformations that maximize the correlation between the projections of two datasets. - **Datasets**: - $X \in \mathbb{R}^{n \times d_1}$: First dataset - $Y \in \mathbb{R}^{n \times d_2}$: Second dataset The goal is to find vectors $w_X$ and $w_Y$ such that the projections $X w_X$ and $Y w_Y$ are maximally correlated. ----  ### Computing CCA Projections 1. **Scatter Matrices**: - $S_{XX} = X^\top X$ - $S_{YY} = Y^\top Y$ - $S_{XY} = X^\top Y$ 2. **Optimization Objective**: $$\max_{w_X, w_Y} \frac{w_X^\top S_{XY} w_Y}{\sqrt{w_X^\top S_{XX} w_X} \sqrt{w_Y^\top S_{YY} w_Y}}$$ ----  3. **Solution via Singular Value Decomposition (SVD)**: - Perform SVD on $S_{XX}^{-1/2} S_{XY} S_{YY}^{-1/2}$: $$U, \Sigma, V^\top = \text{SVD}\left(S_{XX}^{-1/2} S_{XY} S_{YY}^{-1/2}\right)$$ - Projections: $P_X = S_{XX}^{-1/2} U$ $P_Y = S_{YY}^{-1/2} V$ ----  ### Transformation Matrix for Model Merging Using CCA to align and merge neural network models. 1. **Compute Projections**: - For activations $X_i^A$ and $X_i^B$ from models $A$ and $B$, compute $P_A, P_B$ ---- 2. **Transformation Matrix**: - Transform model $B$'s parameters to align with model $A$: $T_i = \left(P_B P_A^{-1}\right)^\top$ 3. **Merge Parameters**: - Merge the aligned parameters of models A and B: $W_i = \frac{1}{2} \left(W_i^A + T_i W_i^B T_{i-1}^{-1}\right)$ Note: This method captures complex relationships between neurons better than permutation-based methods. ----  ![](https://hackmd.io/_uploads/ByWyisoXC.png) ---  ### Experiments ----  ### Datasets CIFAR-10, CIFAR-100, and ImageNet-200 ![cifar](https://datasets.activeloop.ai/wp-content/uploads/2022/09/CIFAR-100-dataset-Activeloop-Platform-visualization-image.webp) ----  #### VGG and ResNet Models <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://production-media.paperswithcode.com/methods/vgg_7mT4DML.png" style="width:45%;"> <div style="margin-left: 10px;"> <em><small>Taken from <a href="https://paperswithcode.com/method/vgg" target="_blank">here</a></small> </em> </div> </div> <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*C8jf92MeHZnxnbpMkz6jkQ.png" style="width:45%;"> <div style="margin-left: 10px;"> <em><small>Taken from <a href="https://medium.com/@siddheshb008/resnet-architecture-explained-47309ea9283d" target="_blank">here</a></small> </em> </div> </div> ----  ## Results ----  ### Accuracy Improvement <div style="text-align: center;"> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/Bkg4Ve27C.png" style="width:70%;"> </div> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/H1yAre3XA.png" style="width:50%;"> <em><small><p>Top and botom: CCA Merge significantly outperforms other baselines. Bottom: ResNet18x4 on ImageNet2000</p></small> </em> </div> </div> ----  ### CCA Merging Finds Better Common Representations Between Many Models ![image](https://hackmd.io/_uploads/rk3edlhQR.png) ----  ### Flexibility of CCA Consider merging two models, $A$ and $B$, at a specific layer with neuron sets $\left\{z_i^M\right\}_{i=1}^n$ for $M \in \left\{A, B\right\}$. - **For our correlation Matrix $C$**: - Element $C_{ij}$: Correlation between neuron $z_i^A$ in model A and neuron $z_j^B$ in model B. - We would like to analyze the distribution of correlations for each neuron $z_i^A$. ---- - **Permutation Hypothesis**: - 1-2-1 mapping between neurons of models $A$ and $B$. - High correlation for a single neuron and near-zero for others. - **CCA Merge Hypothesis**: - Captures relationships where multiple neurons from model $B$ are highly correlated with $z_i^A$. Features are learned across multiple neurons in model $B$. ----  ### Empirical Evidence of CCA’s Flexibility ![image](https://hackmd.io/_uploads/SJlcnl27A.png) ----  ## Distribution of Correlation and CCA Merge Coefficients <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/SycIJbhmC.png" style="width:55%;"> <em><small>Figure 3: Distributions of top 1 (left) and 2 (right) correlations (blue) and CCA Merge transformation coefficients (orange) across neurons from model A at two merging layers. Wasserstein distance between the distributions of top k correlations and top k Merge CCA coefficients are reported.</small> </em> </div> ----  ## CCA Merge is better in OoD Settings ![image](https://hackmd.io/_uploads/HyX1-ZnQR.png) ---  # Current Work: Pairwise Merging ---- -  <b>LMC</b> implies a linear path between two minima subject to adequate permutations. -  Interpolating via equal contribution per model might not be optimal. -  Individual features from Model A might be lost after averaging with Model B. ----  <div style="text-align: center;"> <img src="https://github.com/samuela/git-re-basin/blob/main/mnist_video.gif?raw=true" style="width:90%;"> </div> ----  ## Non-uniform Parameter-wise model merging $$W_i = \mathbf{\alpha}_i \odot W_i^A + (\mathbf{1} - \mathbf{\alpha}_i) \odot W_i^{B'}$$ where $\odot$ is the elementwise product and $\mathbf{1}$ is the matrix of ones of the necessary size. We use gradient-based optimization to learn the elements of $\mathbf{\alpha} = \bigcup_{i=1} \mathbf{\alpha}_i$. ----  <div style="text-align: center;"> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BkWEQSnQR.png" style="width:60%;"> </div> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BkkBQr2QA.png" style="width:60%;"> </div> </div> ----  <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/ByREVB3XA.png" style="width:90%;"> </div> ----  ## Merging Multiple Models _Hierarchical Pairwise_ ![image](https://hackmd.io/_uploads/SJpNml2XR.png) ----  ![image](https://hackmd.io/_uploads/Syb6VHhmA.png) ----  ## Feature Interpretation ![image](https://hackmd.io/_uploads/HJbJBS3QA.png) ---- ## What about initialization? - Idea: Fisher Merging ([Matena, Raffel 2021](https://arxiv.org/abs/2111.09832), [Dhawan, et al. 2023](https://arxiv.org/abs/2311.10291)) <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/Sk38JLhQ0.png" style="width:80%;"> </div> ---  ## Conclusion and Takeaways ----  ### Key Takeaways ----  ### Potential Research Paths Future work could address: ----  ### Points for Reflection (Courtesy of ChatGPT) - The ethical implications of more accessible models. - How CCA Merge's parameter efficiency might influence the democratization of AI. - The long-term impact of such techniques on the development cycle of models.