BYOL - HackMD

# BYOL ###### tags: `self-sup` ## 概述 **BYOL** 架構是 **Self-Supervised Image Representation learning** 的一種方法。架構使用兩個 network，分別為 **online** & **target** network，online network 負責學習 target network 的 representation，並藉由 online network 的 slow-moving average 更新 target network。BYOL 架構不需要使用 **negative pair**。 ## 介紹作者認為 BYOL 中的下列技術避免了 collapse solution 產生： 1. online network 中引入 predictor。 2. 使用 online network 的 slow-moving average 來更新 target network。使用 linear evaluation 對 ImageNet 進行評估，BYOL 在 ResNet-50 上取得了 74.3% 的 top-1 accuracy。 ## 方法 1. **動機**：沒有 negative sample 情況下，直接訓練 siamese network 會產生 collaspe solution，實驗顯示，**使用 fixed randomly initialized network 作為 target network 來訓練 online network 在 ImageNet 上取得了 18.8% top-1 accuracy (權重不固定的話，只有 1.4% top-1 accuracy)**。因此，作者提出一種 **藉由 online network 來學習 target network 的方法，並同時讓 target network 的 representation quality 逐漸提高的方法**。 2. **步驟** + 輸入圖片 $x$，使用兩種不同的 augmentation $t$, $t^{\prime}$ 分別作用於 $x$，產生 $v = t(x), v^{\prime} = t^{\prime}(x)$。 + 將 $v$ 放入 online network 中，產生 representation $y_{\theta} = f_{\theta}(v)$ 與 projection $z_{\theta} = g_{\theta}(y)$。 + 將 $v^{\prime}$ 放入 target network 中，產生 representation $y_{\xi}^{\prime} = f_{\xi}(v^{\prime})$ 與 projection $z^{\prime}_{\xi} = g_{\xi}(y^{\prime})$。 + 將 Predictor $q_{\theta}$ 作用於 $z_{\theta}$ (Predictor only applied to online branch)。 + 對 $q_{\theta}(z_{\theta})$ 與 $z_{\xi}^{\prime}$ 進行 normalize。 $$ \bar{q_{\theta}}(z_{\theta}) = \frac{q_{\theta}(z_{\theta})}{\| q_{\theta}(z_{\theta}) \|_2}, \quad \bar{z_{\xi}}^{\prime} = \frac{\bar{z_{\xi}}^{\prime}}{\| \bar{z_{\xi}}^{\prime} \|_2} $$ + 計算損失函數 $$ \mathcal{L_{\theta, \xi}} = \| \bar{q_{\theta}}(z_{\theta}) - \bar{z_{\xi}}^{\prime}\|_2^2 = 2 - 2 \cdot \frac{<q_{\theta}(z_{\theta}), z_{\xi}^{\prime}>}{\| q_{\theta}(z_{\theta}) \|_2 \cdot \| z_{\xi}^{\prime} \|_2}. $$ + 將 $v$, $v^{\prime}$ 順序反轉，重複上述步驟，的到 $\tilde{\mathcal{L}}_{\theta, \xi}$ + 極小化 $\mathcal{L}_{\theta, \xi}^{BYOL} = \mathcal{L}_{\theta, \xi} + \tilde{\mathcal{L}}_{\theta, \xi}$ with respect to $\theta$, but not $\xi$。 $$ \begin{align} &\theta \leftarrow optimizer(\theta, \nabla_{\theta} \mathcal{L}_{\theta, \xi}^{BYOL}, \eta), \\ &\xi \leftarrow \tau \xi + (1-\tau) \theta. \end{align} $$ + 訓練結束，僅會保留 $f_{\theta}$。 ![](https://hackmd.io/_uploads/Skke5OVSh.png) :::warning 作者並沒有就加入 predictor 給出具體的解釋 :::