Pre-Trained Models: Past, Present and Future

> [Paper link](https://arxiv.org/pdf/2106.07139.pdf) | [Note link](https://zhuanlan.zhihu.com/p/429634259) | Aug 2021 ## Abstract It is now the consensus of the AI community to **adopt PTMs as backbone for downstream tasks rather than learning models from scratch.** These breakthroughs are driven by the surge of computational power and the increasing availability of data, towards four important directions 1. Designing effective architectures 2. Utilizing rich contexts 3. Improving computational efficiency 4. Conducting interpretation and theoretical analysis. This paper show their view, and hope they can inspire and advance the future study of PTMs. ## Introduction Nowadays, Deep neural networks suffer from data hungry, they're easy to overfit without sufficient training data. And it is expensive and time-consuming to manually annotate large-scale data. One milestone for this issue is the introduction of **transfer learning**, it just need to solve new problems with very few samples. Transfer learning has two periods: 1. A pre-training phase to capture knowledge from one or more source tasks 2. Fine-tuning stage to transfer the captured knowledge to target tasks <center> <img src = "https://i.imgur.com/dNYicOI.png"> </center> Since the method solve data hungry, it has soon been widely applied to the field of computer vision (CV), e.g. CNN model... This triggers the first wave of exploring pre-trained models (PTMs) in the era of deep learning. And then NLP was also aware of the potential of PTMs. <center> <img src = "https://i.imgur.com/jpZWPFX.png"> </center> To take full advantage of large-scale unlabeled corpora to provide versatile linguistic knowledge for NLP tasks, the NLP community adopts **self-supervised learning.** The motivation of self-supervised learning is to leverage intrinsic correlations in the text as supervision signals instead of human supervision. Through self-supervised learning, tremendous amounts of unlabeled textual data can be utilized to capture versatile linguistic knowledge without labor-intensive workload. Self-supervised learning obtains supervisory signals from the data itself, often leveraging the underlying structure in the data. :::info **History for NLP with PTMs** 1. Pre-training shallow networks to capture semantic meanings of words (Word2Vec/GloVe) But limitation to represent polysemous words in different contexts 2. Pre-training RNNs to provide contextualized word embeddings Still limited by their model size and depth 3. Transformers makes it feasible to train very deep neural models for NLP tasks (BERT/GPT) ::: Although PTMs have improved the model performance on various AI tasks, several fundamental issues about PTMs still remain: - Still not clear for us the nature hidden in huge amounts of model parameters - Huge computational cost of training ## Background In this section, this paper introduces the development of pre-training in the AI spectrum, from early **supervised pre-training** to current **self-supervised pre-training**. ### Transfer Learning and Supervised Pre-Training The early efforts of pre-training are mainly involved in **transfer learning**, which aims to capture important knowledge from multiple source tasks and then apply the knowledge to a target task. Generally, two pre-training approaches are widely explored in transfer learning: 1. **Feature transfer**: pre-train effective feature representations to pre-encode knowledge across domains and tasks 2. **Parameter transfer**: the method follow an intuitive assumption that source tasks and target tasks can share model parameters or prior distributions of hyper-parameters, and then transfer the knowledge by fine-tuning pre-trained parameters with the data of target tasks. ### Self-Supervised Learning and Self-Supervised Pre-Training <center> <img src = "https://i.imgur.com/t1PXmoS.png"> Transfer learning can be categorized under four sub-settings (colored term) </center> Although Supervised Pre-traing like "CoVE" have achieved promising results on NLP tasks, it is nearly impossible to annotate a textual dataset as large as ImageNet. Hence, applying self-supervised learning to utilize unlabeled data becomes the best choice to pre-train models for NLP tasks. After propose **Transformers** to deal with sequential data, PTMs for NLP tasks have entered a new stage, because it is possible to train deeper language models compared to conventional CNNs and RNNs. ## Transformer and Representative PTMs This section will introduce two landmark Transformer-based PTMs, GPT and BERT. All subsequent PTMs are variants of these two models. The final part of this section gives a brief review of typical variants after GPT and BERT to reveal the recent development of PTMs. <center> <img src = "https://i.imgur.com/GUQykZ5.png"> </center> ### Transformer Each **encoder** block is composed of a multi-head self-attention layer and a position-wise feed-forward layer. And for each decoder block has an additional cross-attention layer since the **decoder** requires to consider the output of the encoder as a context for generation. Between neural layers, **residual connection** and **layer normalization** are employed, making it possible to train a deep Transformer. <center> <img src = "https://i.imgur.com/lHUGkCD.png"> <a href = "https://pytorch.org/tutorials/beginner/transformer_tutorial.html">LANGUAGE MODELING WITH NN.TRANSFORMER AND TORCHTEXT</a> </center> #### Attention Layer **Self-attention layers are the key to the success of Transformer.** Intuitively, $\mathcal{Q}$ is the set of vectors to calculate the attention for, $\mathcal{K}$ is the set of vectors to calculate the attention against. Given a query set $\mathcal{Q} = \{ q_1, ... q_n\}$, a key set $\mathcal{K} = \{ k_1, ... k_m\}$, a value set $\mathcal{V} = \{ v_1, ... v_m \}$ $$ \begin{aligned} & \left\{\mathbf{h}_1, \ldots, \mathbf{h}_n\right\}=\operatorname{ATT}(\mathcal{Q}, \mathcal{K}, \mathcal{V}) \\ & \mathbf{h}_i=\sum_{j=1}^m a_{i j} \mathbf{v}_j \\ & a_{i j}=\frac{\exp \left(\operatorname{ATT}-\operatorname{Mask}\left(\frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}}\right)\right)}{\sum_{l=1}^m \exp \left(\operatorname{ATT}-\operatorname{Mask}\left(\frac{\mathbf{q}_i \cdot \mathbf{k}_l}{\sqrt{d_k}}\right)\right)} \end{aligned} $$ Note that, the masking function ATT-Mask (.) is used to restrict which key-value pairs each query vector can attend. If we do not want $\mathbf{q}_i$ to attend $\mathbf{k}_j, \mathrm{ATT}-\operatorname{Mask}(x)=-\infty$, otherwise $\mathrm{ATT}-\operatorname{Mask}(x)=x$. Such that the attention can be simplified to $$ \begin{aligned} & \mathbf{H}=\operatorname{ATT}(\mathbf{Q}, \mathbf{K}, \mathbf{V})=\mathbf{A V} \\ & \mathbf{A}=\operatorname{Softmax}\left(\operatorname{ATT}-\operatorname{Mask}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{d_k}}\right)\right) \end{aligned} $$ where $\operatorname{sof} \operatorname{tmax}(\cdot)$ is applied in a row-wise manner, $\mathbf{A} \in \mathbb{R}^{n \times m}$ is the attention matrix, $\mathbf{H} \in$ $\mathbb{R}^{n \times d_v}$ is the result. <center> <img src = "https://i.imgur.com/pSjX6G9.png" width="100%"> <a href = "https://mkh800.medium.com/%E7%AD%86%E8%A8%98-attention-%E5%8F%8A-transformer-%E6%9E%B6%E6%A7%8B%E7%90%86%E8%A7%A3-c9c5479fdc8a">Scaled Dot-Product Attention & Multi-Head Attention</a> </center> Transformer applies a multi-head attention layer defined as follows $$ \begin{aligned} \mathbf{H} & =\operatorname{MH}-\operatorname{ATT}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) \\ & =\operatorname{Concat}\left(\mathbf{H}_1, \ldots, \mathbf{H}_h\right) \mathbf{W}^O \\ \mathbf{H}_i & =\operatorname{AtT}\left(\mathbf{Q} \mathbf{W}_i^Q, \mathbf{K W}_i^K, \mathbf{V} \mathbf{W}_i^V\right) \end{aligned} $$ #### Position-Wise Feed-Forward Layer Each block of Transformer also contains a position-wise feed-forward layer. Given the packed input matrix $\mathbf{X} \in \mathbb{R}^{n \times d_i}$ indicating a set of input vectors, $d_i$ is the vector dimension, a position-wise feed-forward layer is defined as $$ \mathbf{H}=\operatorname{FFN}(\mathbf{X})=\sigma\left(\mathbf{X} \mathbf{W}_1+\mathbf{b}_1\right) \mathbf{W}_2+\mathbf{b}_2, $$ where $\sigma(\cdot)$ is the activation function (usually the ReLU function). #### Residual Connection and Normalization It makes the architecture of Transformer possible to be deep. Formally, given a neural layer $f(\cdot)$, the **residual connection** and **normalization** layer is defined as $$ \mathbf{H}=A \& N(\mathbf{X})=\text { LayerNorm }(f(\mathbf{X})+\mathbf{X}) $$ where LayerNorm(.) denotes the layer normalization operation. :::info **Applications of Attention in the Model** * In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9]. * The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. * Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. It prevents leftward information flow in the decoder to preserve the auto-regressive property. Such that they implement this inside of scaled dot-product attention by masking out (setting to $−∞$) all values in the input of the softmax which correspond to illegal connections. ::: **Transformer also serves as the backbone neural structure for the subsequently derived PTMs.** Next, this paper will introduce two landmarks that completely open the door towards the era of large-scale self-supervised PTMs, GPT and BERT. **In general, GPT is good at natural language generation, while BERT focuses more on natural language understanding.** ### GPT Equipped by the Transformer decoder as the backbone. Since GPT uses autoregressive language modeling for the pre-training objective, the cross-attention in the original Transformer decoder is removed. Formally, given a corpus consisting of tokens $\mathcal{X}=\left\{x_0, x_1, \ldots, x_n, x_n+1\right\}$, GPT applies a standard language modeling objective by maximizing the following log-likelihood: $$ \mathcal{L}(\mathcal{X})=\sum_{i=1}^{n+1} \log P\left(x_i \mid x_{i-k}, \ldots, x_{i-1} ; \Theta\right) $$ where $k$ is the window size, the probability $P$ is modeled by the Transformer decoder with parameters $\Theta, x_0$ is the special token $[\mathrm{CLS}]$, $x_{n+1}$ is the special token $[\mathrm{SEP}]$. <center> <img src = "https://i.imgur.com/MsIH2k6.png"> </center> The adaptation procedure of GPT to specific tasks is fine-tuning, by using the pre-trained parameters of GPT as a start point of downstream tasks. ### BERT There are also two separate stages to adapt BERT for specific tasks, **pre-training** and **fine-tuning** <center> <img src = "https://i.imgur.com/F1EaYFL.png"> </center> In the **pre-training phase**, BERT applies **autoencoding** language modeling rather than autoregressive language modeling used in GPT. It will pre-trained by randomly masked with a special token $[\mathrm{MASK}]$. Such that, BERT can lead to a deep bidirectional representation of all tokens. Formally, given a corpus consisting of tokens $\mathcal{X}=$ $\left\{x_0, x_1, \ldots, x_n, x_n+1\right\}$, BERT randomly masks $m$ tokens in $\mathcal{X}$ and then maximizes the following log-likelihood: $$ \mathcal{L}(\mathcal{X})=\sum_{i=1}^m \log P\left([\text { Mask }]_i=y_i \mid \tilde{\mathcal{X}} ; \Theta\right) $$ where the probability $P$ is modeled by the Transformer encoder with parameters $\Theta, \tilde{\mathcal{X}}$ is the result after masking some tokens in $\mathcal{X},[\mathrm{Mask}]_i$ is the $i$-th masked position, and $y_i$ is the original token at this position. Besides MLM, the objective of **next sentence prediction (NSP)** is also adopted to capture discourse relationships between sentences for some downstream tasks with multiple sentences Two sentences concatenated with the special token $[\mathrm{SEP}]$, which could represent: 1. Sentence pairs in paraphrase 2. Hypothesis-premise pairs in entailment 3. Question-passage pairs in question answering 4. A single sentence for text classification or sequence tagging $[\mathrm{CLS}]$ can be fed into an extra layer for classification ### After GPT and BERT * [RoBERTa](https://blog.csdn.net/fengxinlinux/article/details/109447004) * Removing the NSP task (NSP thinks useless for the training of BERT) * More training steps, with bigger batch size and more data * Longer training sentences * Dynamically changing the $[\mathrm{MASK}]$ pattern * [ALBERT](https://zhuanlan.zhihu.com/p/84273154) * Factorizes the input word embedding matrix into two smaller ones * Enforces parameter-sharing between all Transformer layers to significantly reduce parameters * It proposes the sentence order prediction (SOP) task to substitute BERT’s NSP task <center> <img src = "https://i.imgur.com/2X7GfZF.png" width = "100%"> </center> ## Designing Effective Architectures In this section, this paper dives into the after-BERT PTMs deeper. ### Unified Sequence Modeling Versatile downstream tasks and applications: - Natural language understanding - Open-ended language generation - Non-open-ended language generation But recently, boundary between understanding and generation is vague * Combining Autoregressive and Autoencoding Modeling * [XLNet](https://blog.csdn.net/u012526436/article/details/93196139): proposes the permutated language modeling * [MPNet](https://zhuanlan.zhihu.com/p/197675066): amends the XLNet’s discrepancy that in pre-training XLNet does not know the sentence’s length while in downstream it knows * [UniLM](https://www.cnblogs.com/gczr/p/12113434.html) * [GLM](https://zhuanlan.zhihu.com/p/579645487) * Applying Generalized Encoder-Decoder The problem which fill in blanks with variable lengths couldn't solve before * [MASS](https://blog.csdn.net/ljp1919/article/details/90312229): introduces the masked-prediction strategy * [T5](https://zhuanlan.zhihu.com/p/88377084): masking a variable-length of span in text with only one mask token and asks the decoder to recover the whole masked sequence * [BART](https://blog.csdn.net/u011150266/article/details/117742695): corrupting the source sequence with multiple operations such as truncation, deletion... :::info **Challenges** 1. Encoder-decoder introduces much more parameters compared to a single encoder/decoder 2. The model do not perform very well on natural language understanding ::: ### Cognitive-Inspired Architectures To improve the model to achieve human beings’ cognitive system, we can improve it by maintainable working memory and sustainable long-term memory * Maintainable Working Memory: A natural problem of Transformer is its fixed window size and quadratic space complexity, which significantly hinders its applications in long document understanding and generation. * [Transformer-XL](https://zhuanlan.zhihu.com/p/70745925): segment-level recurrence and relative positional encoding * [CogQA](https://blog.csdn.net/m0_46522688/article/details/114338979): maintain a cognitive graph in the multi-hop reading * [CogLTX](https://zhuanlan.zhihu.com/p/304764328): leverages a MemRecall language model to select sentences that should be maintained in the working memory and task-specific modules for answering or classification. * Sustainable Long-Term Memory: The success of GPT-3 shows that Transformers can memorize, but how ? * Replace the feed-forward networks in a Transformer layer with large key-value memory networks * [REALM](https://zhuanlan.zhihu.com/p/360635601): construct a sustainable external memory for Transformers * [RAG](https://blog.csdn.net/qq_40212975/article/details/109046150): extends the masked pre-training to autoregressive generation ### More Variants of Existing PTMs Besides the practice to unify sequence modeling and construct cognitive-inspired architectures, most current studies focus on optimizing BERT’s architecture to **boost language models’ performance on natural language understanding.** ## Utilizing Multi-Source Data In this section, this paper will introduce some typical PTMs that take advantage of multi-source heterogeneous data ### Multilingual Pre-Training Some researchers found that they could get even better performance on benchmarks when training one model with several languages comparing with training several monolingual models. * Multilingual LSTMs: learn through parameter sharing * WGAN: learn language-agnostic constraints by decoupling language representations into language-specific and language-agnostic representations But above works only focus on specific task. To generalize the performance, self-supervised tasks and then fine-tuning on specific downstream tasks is feasible. * Multilingual tasks Step 1: Understanding tasks: sentence-level or word-level classification Step 2: Generation tasks according to task objectives - Related model - [multilingual BERT (mBERT)](https://zhuanlan.zhihu.com/p/353514133): using multilingual data can enable the model to learn cross-lingual representations. - [XLM-R](https://blog.csdn.net/ljp1919/article/details/103206663): thaey build bigger dataset then previous ones, called CC-100 (non-parallel), and has better performance However, the MMLM task cannot well utilize parallel corpora. In fact, parallel corpora are quite important for some NLP tasks such as machine translation. Such that, XLM leverages bilingual sentence pairs to perform the **translation language modeling (TLM)** task. Compared with MLM, TLM requires models to predict the masked tokens depending on the bilingual contexts. - Similar works with TLM - [Unicoder](https://blog.csdn.net/gjh1716718326/article/details/122085422): CLWR / CLPC, this model enables to learn word-level alignments between different languages - ALM: automatically generates code-switched sequences from parallel sentences and performs MLM on it - InfoXLM, InfoXLM, HICTL, ERNIE-M ### Multimodal Pre-Training Modalities, such as audio, video, image and text, refer to how something happens or is experienced. Existing cross-modal pre-training PTMs mainly focus on 1. Improving model architecture 2. Utilizing more data 3. Designing better pre-training tasks For image-text-based PTMs: visual and textual content in a unified semantic space - Two-stream: [ViLBERT](https://blog.csdn.net/csdn_tclz/article/details/109448343), [LXMERT](https://blog.csdn.net/xiasli123/article/details/104166051) - Single-stream: VisualBERT, Unicoder-VL, B2T2 > Skip some related works for PTMs with image For video and audio PTMs: VideoBERT, [SpeechBERT](https://www.jianshu.com/p/6a6fe370a964) ### Knowledge-Enhanced Pre-Training In this subsection, we can know that there has a way using external knowledge according to the knowledge format and introduce several methods attempting to combine knowledge with PTMs. (integrating entity and relation embeddings / or their alignments with the text) ## Improving Computational Efficiency In this section, this paper will introduce how to improve computational efficiency from the following three aspects ### System-Level Optimization System-level optimization methods are often model-agnostic and do not change underlying learning algorithms. Therefore, they are widely used in training large-scale PTMs. * Single-Device Optimization: Solve with reduce redundant representation of floating-point numbers * Multi-Device Optimization: Data parallelism is preferred as long as it can conquer the excessive requirement of memory capacity ### Efficient Pre-Training * Training Methods * Model Architectures ### Model Compression * Parameter Sharing * Model Pruning: which cuts off some useless parts in PTMs to achieve accelerating while maintaining the performance * Knowledge Distillation: refers to the compression of higher-precision floating-point parameters to lower-precision floating-point ones ## Interpretation and Theoretical Analysis Beyond the superior performance of PTMs on various NLP tasks, researchers also explore to interpret the behaviors of PTMs ### Knowledge of PTMs * Linguistic knowledge: * **Representation Probing**: Fix the parameters of PTMs and train a new linear layer on the hidden representations of PTMs for a specific probing task * Representation Analysis: Use the hidden representations of PTMs to compute some statistics such as distances or similarities * Attention analysis: compute statistics about attention matrices and is more suitable to discover the hierarchical structure of texts * Generation Analysis: Use language models to directly estimate the probabilities of different sequences or words * World knowledge Learn rich world knowledge from pre-training, mainly including * Commonsense knowledge * Factual knowledge ### Robustness of PTMs Recent works have identified the severe robustness problem in PTMs using adversarial examples. Current works mainly utilize the model prediction, prediction probabilities, and model gradients of the models to search adversarial examples. ### Structural Sparsity of PTMs Transformer meets the problem of over parameterization. When removing part of attention heads, we can achieve better performance. Also, some papers show that theys can improvement the performance by simply duplicating some hidden layers to increase the model capacity ### Theoretical Analysis of PTMs It is effective to train a deep belief network by greedy layer-wise unsupervised pre-training followed by supervised fine-tuning , and **contrast learning** including language modeling has become the mainstream approach Some papers introduce the concept of latent classes and the semantically similar pairs are from the same latent class ## Future Directions * Architectures and pre-training methods * Architectures: We may need to carefully design task-specific architectures according to the type of downstream tasks. * Pre-Training Tasks: A more practical direction is to design more efficient self-supervised pre-training tasks and training methods according to the capabilities of existing hardware and software * Multilingual and multimodal pre-Training * Computational efficiency * Theoretical foundation * Modeledge learning A method can refer to the knowledge stored in PTMs as “modeledge”, which is distinguished from the discrete symbolic knowledge formalized by human beings. * Knowledge-Aware Tasks * Modeledge Storage and Management * Cognitive learning Making PTMs more knowledgeable is an important topic for the future of PTMs * Knowledge Augmentation * Knowledge Support * Knowledge Supervision * Cognitive Architecture * Explicit and Controllable Reasoning * Novel applications ## Conclusion The knowledge stored in PTMs is represented as real-valued vectors, which is quite different from the discrete symbolic knowledge formalized by human beings. Naming this continuous and machine-friendly knowledge **“modeledge”** and believe that it is promising to capture the modeledge in a more effective and efficient way and stimulate the modeledge for specific tasks.