# data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language - ICML2022 ###### tags: `Yang` ### Introduction Research in self-supervised algorithms has focused on individual modalities which results in specific designs and learning biases (opertions are different in different domains). While learning biases are certainly helpful, it is often unclear whether they will generalize to other modalities. Moreover, leading theories on the biology of learning imply that humans likely use similar learning processes to understand the visual world as they do for language. In an effort to get closer to machines that learn in general ways about the environment, we designed data2vec, a framework for general self-supervised learning that works for images, speech and text where the learning objective is identical in each modality. The core idea is to **predict latent representations** of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. ### Method #### Model Architecture ![](https://i.imgur.com/xRODlmc.png) #### Training targets The model is trained to predict the model representations of the original unmasked training sample based on an encoding of the masked sample. The representations we predict are ***contextualized representations***, encoding the particular time-step but also other information from the sample due to the use of self-attention in the Transformer network. - Teacher parameterization The encoding of the unmasked training sample is parameterized by an exponentially moving average (EMA) of the model parameters $$ \Delta \leftarrow \tau \Delta+(1-\tau) \theta $$ We use a schedule for τ that linearly increases this parameter from $τ_0$ to the target value $τ_e$ over the first $τ_n$ updates We found it more efficient and slightly more accurate to share the parameters of the feature encoder and the positional encoder between the teacher and student networks. - Target Training targets are constructed based on the output of the top $K$ blocks of the teacher network for time-steps which are masked in student-mode. The output of block $l$ at time-step $t$ is denoted as $a_t^l$. We apply a normalization to each block to obtain $\hat{a_t^l}$ $$y_t = \frac{1}{K}\sum_{l=L-K+1}^{L} \hat{a_t^l} $$ #### Objective Given contextualized training targets $y_t$, we use a Smooth L1 loss to regress these targets: $$ \mathcal{L}\left(y_{t}, f_{t}(x)\right)= \begin{cases}\frac{1}{2}\left(y_{t}-f_{t}(x)\right)^{2} / \beta & \left|y_{t}-f_{t}(x)\right| \leq \beta \\ \left(\left|y_{t}-f_{t}(x)\right|-\frac{1}{2} \beta\right) & \text { otherwise }\end{cases} $$ where $β$ controls the transition from a squared loss to an L1 loss, depending on the size of the gap between the target $y_t$ and the model prediction $f_t(x)$ at time-step $t$. The advantage of this loss is that it is less sensitive to outliers, however, we need to tune the setting of $β$. ### Results - computer vision ![](https://i.imgur.com/E0zLadw.png) - Speech processing ![](https://i.imgur.com/LXC1r03.png) - Natural language processing ![](https://i.imgur.com/Kvwo7Rn.png) - Ablation study ![](https://i.imgur.com/PQJesvn.png) ### Conclusion - Recent work showed that uniform model architectures can be ffective for multiple modalities - Whereas we show that a single self-supervised learning regime can be effective for vision, speech and language. - The key idea is to regress contextualized representations based on a partial input view