Machine_S
  • NEW!
    NEW!  Connect Ideas Across Notes
    Save time and share insights. With Paragraph Citation, you can quote others’ work with source info built in. If someone cites your note, you’ll see a card showing where it’s used—bringing notes closer together.
    Got it
        • Sharing URL Link copied
        • /edit
        • View mode
          • Edit mode
          • View mode
          • Book mode
          • Slide mode
          Edit mode View mode Book mode Slide mode
        • Customize slides
        • Note Permission
        • Read
          • Owners
          • Signed-in users
          • Everyone
          Owners Signed-in users Everyone
        • Write
          • Owners
          • Signed-in users
          • Everyone
          Owners Signed-in users Everyone
        • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invite by email
        Invitee

        This note has no invitees

      • Publish Note

        Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

        Your note will be visible on your profile and discoverable by anyone.
        Your note is now live.
        This note is visible on your profile and discoverable online.
        Everyone on the web can find and read all notes of this public team.

        Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Explore these features while you wait
        Complete general settings
        Bookmark and like published notes
        Write a few more notes
        Complete general settings
        Write a few more notes
        See published notes
        Unpublish note
        Please check the box to agree to the Community Guidelines.
        View profile
      • Commenting
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Suggest edit
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
      • Emoji Reply
      • Enable
      • Versions and GitHub Sync
      • Note settings
      • Note Insights New
      • Engagement control
      • Make a copy
      • Transfer ownership
      • Delete this note
      • Insert from template
      • Import from
        • Dropbox
        • Google Drive
        • Gist
        • Clipboard
      • Export to
        • Dropbox
        • Google Drive
        • Gist
      • Download
        • Markdown
        • HTML
        • Raw HTML
    Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Help
    Menu
    Options
    Engagement control Make a copy Transfer ownership Delete this note
    Import from
    Dropbox Google Drive Gist Clipboard
    Export to
    Dropbox Google Drive Gist
    Download
    Markdown HTML Raw HTML
    Back
    Sharing URL Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Customize slides
    Note Permission
    Read
    Owners
    • Owners
    • Signed-in users
    • Everyone
    Owners Signed-in users Everyone
    Write
    Owners
    • Owners
    • Signed-in users
    • Everyone
    Owners Signed-in users Everyone
    Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    > [Paper link](https://arxiv.org/pdf/2106.07139.pdf) | [Note link](https://zhuanlan.zhihu.com/p/429634259) | Aug 2021 ## Abstract It is now the consensus of the AI community to **adopt PTMs as backbone for downstream tasks rather than learning models from scratch.** These breakthroughs are driven by the surge of computational power and the increasing availability of data, towards four important directions 1. Designing effective architectures 2. Utilizing rich contexts 3. Improving computational efficiency 4. Conducting interpretation and theoretical analysis. This paper show their view, and hope they can inspire and advance the future study of PTMs. ## Introduction Nowadays, Deep neural networks suffer from data hungry, they're easy to overfit without sufficient training data. And it is expensive and time-consuming to manually annotate large-scale data. One milestone for this issue is the introduction of **transfer learning**, it just need to solve new problems with very few samples. Transfer learning has two periods: 1. A pre-training phase to capture knowledge from one or more source tasks 2. Fine-tuning stage to transfer the captured knowledge to target tasks <center> <img src = "https://i.imgur.com/dNYicOI.png"> </center> <br> Since the method solve data hungry, it has soon been widely applied to the field of computer vision (CV), e.g. CNN model... This triggers the first wave of exploring pre-trained models (PTMs) in the era of deep learning. And then NLP was also aware of the potential of PTMs. <center> <img src = "https://i.imgur.com/jpZWPFX.png"> </center> <br> To take full advantage of large-scale unlabeled corpora to provide versatile linguistic knowledge for NLP tasks, the NLP community adopts **self-supervised learning.** The motivation of self-supervised learning is to leverage intrinsic correlations in the text as supervision signals instead of human supervision. Through self-supervised learning, tremendous amounts of unlabeled textual data can be utilized to capture versatile linguistic knowledge without labor-intensive workload. Self-supervised learning obtains supervisory signals from the data itself, often leveraging the underlying structure in the data. :::info **History for NLP with PTMs** 1. Pre-training shallow networks to capture semantic meanings of words (Word2Vec/GloVe) But limitation to represent polysemous words in different contexts 2. Pre-training RNNs to provide contextualized word embeddings Still limited by their model size and depth 3. Transformers makes it feasible to train very deep neural models for NLP tasks (BERT/GPT) ::: Although PTMs have improved the model performance on various AI tasks, several fundamental issues about PTMs still remain: - Still not clear for us the nature hidden in huge amounts of model parameters - Huge computational cost of training ## Background In this section, this paper introduces the development of pre-training in the AI spectrum, from early **supervised pre-training** to current **self-supervised pre-training**. ### Transfer Learning and Supervised Pre-Training The early efforts of pre-training are mainly involved in **transfer learning**, which aims to capture important knowledge from multiple source tasks and then apply the knowledge to a target task. Generally, two pre-training approaches are widely explored in transfer learning: 1. **Feature transfer**: pre-train effective feature representations to pre-encode knowledge across domains and tasks 2. **Parameter transfer**: the method follow an intuitive assumption that source tasks and target tasks can share model parameters or prior distributions of hyper-parameters, and then transfer the knowledge by fine-tuning pre-trained parameters with the data of target tasks. ### Self-Supervised Learning and Self-Supervised Pre-Training <center> <img src = "https://i.imgur.com/t1PXmoS.png"> Transfer learning can be categorized under four sub-settings (colored term) </center> <br> Although Supervised Pre-traing like "CoVE" have achieved promising results on NLP tasks, it is nearly impossible to annotate a textual dataset as large as ImageNet. Hence, applying self-supervised learning to utilize unlabeled data becomes the best choice to pre-train models for NLP tasks. After propose **Transformers** to deal with sequential data, PTMs for NLP tasks have entered a new stage, because it is possible to train deeper language models compared to conventional CNNs and RNNs. ## Transformer and Representative PTMs This section will introduce two landmark Transformer-based PTMs, GPT and BERT. All subsequent PTMs are variants of these two models. The final part of this section gives a brief review of typical variants after GPT and BERT to reveal the recent development of PTMs. <center> <img src = "https://i.imgur.com/GUQykZ5.png"> </center> ### Transformer Each **encoder** block is composed of a multi-head self-attention layer and a position-wise feed-forward layer. And for each decoder block has an additional cross-attention layer since the **decoder** requires to consider the output of the encoder as a context for generation. Between neural layers, **residual connection** and **layer normalization** are employed, making it possible to train a deep Transformer. <center> <img src = "https://i.imgur.com/lHUGkCD.png"> <a href = "https://pytorch.org/tutorials/beginner/transformer_tutorial.html">LANGUAGE MODELING WITH NN.TRANSFORMER AND TORCHTEXT</a> </center> #### Attention Layer **Self-attention layers are the key to the success of Transformer.** Intuitively, $\mathcal{Q}$ is the set of vectors to calculate the attention for, $\mathcal{K}$ is the set of vectors to calculate the attention against. Given a query set $\mathcal{Q} = \{ q_1, ... q_n\}$, a key set $\mathcal{K} = \{ k_1, ... k_m\}$, a value set $\mathcal{V} = \{ v_1, ... v_m \}$ $$ \begin{aligned} & \left\{\mathbf{h}_1, \ldots, \mathbf{h}_n\right\}=\operatorname{ATT}(\mathcal{Q}, \mathcal{K}, \mathcal{V}) \\ & \mathbf{h}_i=\sum_{j=1}^m a_{i j} \mathbf{v}_j \\ & a_{i j}=\frac{\exp \left(\operatorname{ATT}-\operatorname{Mask}\left(\frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}}\right)\right)}{\sum_{l=1}^m \exp \left(\operatorname{ATT}-\operatorname{Mask}\left(\frac{\mathbf{q}_i \cdot \mathbf{k}_l}{\sqrt{d_k}}\right)\right)} \end{aligned} $$ Note that, the masking function ATT-Mask (.) is used to restrict which key-value pairs each query vector can attend. If we do not want $\mathbf{q}_i$ to attend $\mathbf{k}_j, \mathrm{ATT}-\operatorname{Mask}(x)=-\infty$, otherwise $\mathrm{ATT}-\operatorname{Mask}(x)=x$. Such that the attention can be simplified to $$ \begin{aligned} & \mathbf{H}=\operatorname{ATT}(\mathbf{Q}, \mathbf{K}, \mathbf{V})=\mathbf{A V} \\ & \mathbf{A}=\operatorname{Softmax}\left(\operatorname{ATT}-\operatorname{Mask}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{d_k}}\right)\right) \end{aligned} $$ where $\operatorname{sof} \operatorname{tmax}(\cdot)$ is applied in a row-wise manner, $\mathbf{A} \in \mathbb{R}^{n \times m}$ is the attention matrix, $\mathbf{H} \in$ $\mathbb{R}^{n \times d_v}$ is the result. <center> <img src = "https://i.imgur.com/pSjX6G9.png" width="100%"> <a href = "https://mkh800.medium.com/%E7%AD%86%E8%A8%98-attention-%E5%8F%8A-transformer-%E6%9E%B6%E6%A7%8B%E7%90%86%E8%A7%A3-c9c5479fdc8a">Scaled Dot-Product Attention & Multi-Head Attention</a> </center> <br> Transformer applies a multi-head attention layer defined as follows $$ \begin{aligned} \mathbf{H} & =\operatorname{MH}-\operatorname{ATT}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) \\ & =\operatorname{Concat}\left(\mathbf{H}_1, \ldots, \mathbf{H}_h\right) \mathbf{W}^O \\ \mathbf{H}_i & =\operatorname{AtT}\left(\mathbf{Q} \mathbf{W}_i^Q, \mathbf{K W}_i^K, \mathbf{V} \mathbf{W}_i^V\right) \end{aligned} $$ #### Position-Wise Feed-Forward Layer Each block of Transformer also contains a position-wise feed-forward layer. Given the packed input matrix $\mathbf{X} \in \mathbb{R}^{n \times d_i}$ indicating a set of input vectors, $d_i$ is the vector dimension, a position-wise feed-forward layer is defined as $$ \mathbf{H}=\operatorname{FFN}(\mathbf{X})=\sigma\left(\mathbf{X} \mathbf{W}_1+\mathbf{b}_1\right) \mathbf{W}_2+\mathbf{b}_2, $$ where $\sigma(\cdot)$ is the activation function (usually the ReLU function). #### Residual Connection and Normalization It makes the architecture of Transformer possible to be deep. Formally, given a neural layer $f(\cdot)$, the **residual connection** and **normalization** layer is defined as $$ \mathbf{H}=A \& N(\mathbf{X})=\text { LayerNorm }(f(\mathbf{X})+\mathbf{X}) $$ where LayerNorm(.) denotes the layer normalization operation. :::info **Applications of Attention in the Model** * In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9]. * The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. * Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. It prevents leftward information flow in the decoder to preserve the auto-regressive property. Such that they implement this inside of scaled dot-product attention by masking out (setting to $−∞$) all values in the input of the softmax which correspond to illegal connections. ::: **Transformer also serves as the backbone neural structure for the subsequently derived PTMs.** Next, this paper will introduce two landmarks that completely open the door towards the era of large-scale self-supervised PTMs, GPT and BERT. **In general, GPT is good at natural language generation, while BERT focuses more on natural language understanding.** ### GPT Equipped by the Transformer decoder as the backbone. Since GPT uses autoregressive language modeling for the pre-training objective, the cross-attention in the original Transformer decoder is removed. Formally, given a corpus consisting of tokens $\mathcal{X}=\left\{x_0, x_1, \ldots, x_n, x_n+1\right\}$, GPT applies a standard language modeling objective by maximizing the following log-likelihood: $$ \mathcal{L}(\mathcal{X})=\sum_{i=1}^{n+1} \log P\left(x_i \mid x_{i-k}, \ldots, x_{i-1} ; \Theta\right) $$ where $k$ is the window size, the probability $P$ is modeled by the Transformer decoder with parameters $\Theta, x_0$ is the special token $[\mathrm{CLS}]$, $x_{n+1}$ is the special token $[\mathrm{SEP}]$. <center> <img src = "https://i.imgur.com/MsIH2k6.png"> </center><br> The adaptation procedure of GPT to specific tasks is fine-tuning, by using the pre-trained parameters of GPT as a start point of downstream tasks. ### BERT There are also two separate stages to adapt BERT for specific tasks, **pre-training** and **fine-tuning** <center> <img src = "https://i.imgur.com/F1EaYFL.png"> </center><br> In the **pre-training phase**, BERT applies **autoencoding** language modeling rather than autoregressive language modeling used in GPT. It will pre-trained by randomly masked with a special token $[\mathrm{MASK}]$. Such that, BERT can lead to a deep bidirectional representation of all tokens. Formally, given a corpus consisting of tokens $\mathcal{X}=$ $\left\{x_0, x_1, \ldots, x_n, x_n+1\right\}$, BERT randomly masks $m$ tokens in $\mathcal{X}$ and then maximizes the following log-likelihood: $$ \mathcal{L}(\mathcal{X})=\sum_{i=1}^m \log P\left([\text { Mask }]_i=y_i \mid \tilde{\mathcal{X}} ; \Theta\right) $$ where the probability $P$ is modeled by the Transformer encoder with parameters $\Theta, \tilde{\mathcal{X}}$ is the result after masking some tokens in $\mathcal{X},[\mathrm{Mask}]_i$ is the $i$-th masked position, and $y_i$ is the original token at this position. Besides MLM, the objective of **next sentence prediction (NSP)** is also adopted to capture discourse relationships between sentences for some downstream tasks with multiple sentences Two sentences concatenated with the special token $[\mathrm{SEP}]$, which could represent: 1. Sentence pairs in paraphrase 2. Hypothesis-premise pairs in entailment 3. Question-passage pairs in question answering 4. A single sentence for text classification or sequence tagging $[\mathrm{CLS}]$ can be fed into an extra layer for classification ### After GPT and BERT * [RoBERTa](https://blog.csdn.net/fengxinlinux/article/details/109447004) * Removing the NSP task (NSP thinks useless for the training of BERT) * More training steps, with bigger batch size and more data * Longer training sentences * Dynamically changing the $[\mathrm{MASK}]$ pattern * [ALBERT](https://zhuanlan.zhihu.com/p/84273154) * Factorizes the input word embedding matrix into two smaller ones * Enforces parameter-sharing between all Transformer layers to significantly reduce parameters * It proposes the sentence order prediction (SOP) task to substitute BERT’s NSP task <center> <img src = "https://i.imgur.com/2X7GfZF.png" width = "100%"> </center> ## Designing Effective Architectures In this section, this paper dives into the after-BERT PTMs deeper. ### Unified Sequence Modeling Versatile downstream tasks and applications: - Natural language understanding - Open-ended language generation - Non-open-ended language generation But recently, boundary between understanding and generation is vague * Combining Autoregressive and Autoencoding Modeling * [XLNet](https://blog.csdn.net/u012526436/article/details/93196139): proposes the permutated language modeling * [MPNet](https://zhuanlan.zhihu.com/p/197675066): amends the XLNet’s discrepancy that in pre-training XLNet does not know the sentence’s length while in downstream it knows * [UniLM](https://www.cnblogs.com/gczr/p/12113434.html) * [GLM](https://zhuanlan.zhihu.com/p/579645487) * Applying Generalized Encoder-Decoder The problem which fill in blanks with variable lengths couldn't solve before * [MASS](https://blog.csdn.net/ljp1919/article/details/90312229): introduces the masked-prediction strategy * [T5](https://zhuanlan.zhihu.com/p/88377084): masking a variable-length of span in text with only one mask token and asks the decoder to recover the whole masked sequence * [BART](https://blog.csdn.net/u011150266/article/details/117742695): corrupting the source sequence with multiple operations such as truncation, deletion... :::info **Challenges** 1. Encoder-decoder introduces much more parameters compared to a single encoder/decoder 2. The model do not perform very well on natural language understanding ::: ### Cognitive-Inspired Architectures To improve the model to achieve human beings’ cognitive system, we can improve it by maintainable working memory and sustainable long-term memory * Maintainable Working Memory: A natural problem of Transformer is its fixed window size and quadratic space complexity, which significantly hinders its applications in long document understanding and generation. * [Transformer-XL](https://zhuanlan.zhihu.com/p/70745925): segment-level recurrence and relative positional encoding * [CogQA](https://blog.csdn.net/m0_46522688/article/details/114338979): maintain a cognitive graph in the multi-hop reading * [CogLTX](https://zhuanlan.zhihu.com/p/304764328): leverages a MemRecall language model to select sentences that should be maintained in the working memory and task-specific modules for answering or classification. * Sustainable Long-Term Memory: The success of GPT-3 shows that Transformers can memorize, but how ? * Replace the feed-forward networks in a Transformer layer with large key-value memory networks * [REALM](https://zhuanlan.zhihu.com/p/360635601): construct a sustainable external memory for Transformers * [RAG](https://blog.csdn.net/qq_40212975/article/details/109046150): extends the masked pre-training to autoregressive generation ### More Variants of Existing PTMs Besides the practice to unify sequence modeling and construct cognitive-inspired architectures, most current studies focus on optimizing BERT’s architecture to **boost language models’ performance on natural language understanding.** ## Utilizing Multi-Source Data In this section, this paper will introduce some typical PTMs that take advantage of multi-source heterogeneous data ### Multilingual Pre-Training Some researchers found that they could get even better performance on benchmarks when training one model with several languages comparing with training several monolingual models. * Multilingual LSTMs: learn through parameter sharing * WGAN: learn language-agnostic constraints by decoupling language representations into language-specific and language-agnostic representations But above works only focus on specific task. To generalize the performance, self-supervised tasks and then fine-tuning on specific downstream tasks is feasible. * Multilingual tasks Step 1: Understanding tasks: sentence-level or word-level classification Step 2: Generation tasks according to task objectives - Related model - [multilingual BERT (mBERT)](https://zhuanlan.zhihu.com/p/353514133): using multilingual data can enable the model to learn cross-lingual representations. - [XLM-R](https://blog.csdn.net/ljp1919/article/details/103206663): thaey build bigger dataset then previous ones, called CC-100 (non-parallel), and has better performance However, the MMLM task cannot well utilize parallel corpora. In fact, parallel corpora are quite important for some NLP tasks such as machine translation. Such that, XLM leverages bilingual sentence pairs to perform the **translation language modeling (TLM)** task. Compared with MLM, TLM requires models to predict the masked tokens depending on the bilingual contexts. - Similar works with TLM - [Unicoder](https://blog.csdn.net/gjh1716718326/article/details/122085422): CLWR / CLPC, this model enables to learn word-level alignments between different languages - ALM: automatically generates code-switched sequences from parallel sentences and performs MLM on it - InfoXLM, InfoXLM, HICTL, ERNIE-M ### Multimodal Pre-Training Modalities, such as audio, video, image and text, refer to how something happens or is experienced. Existing cross-modal pre-training PTMs mainly focus on 1. Improving model architecture 2. Utilizing more data 3. Designing better pre-training tasks For image-text-based PTMs: visual and textual content in a unified semantic space - Two-stream: [ViLBERT](https://blog.csdn.net/csdn_tclz/article/details/109448343), [LXMERT](https://blog.csdn.net/xiasli123/article/details/104166051) - Single-stream: VisualBERT, Unicoder-VL, B2T2 > Skip some related works for PTMs with image For video and audio PTMs: VideoBERT, [SpeechBERT](https://www.jianshu.com/p/6a6fe370a964) ### Knowledge-Enhanced Pre-Training In this subsection, we can know that there has a way using external knowledge according to the knowledge format and introduce several methods attempting to combine knowledge with PTMs. (integrating entity and relation embeddings / or their alignments with the text) ## Improving Computational Efficiency In this section, this paper will introduce how to improve computational efficiency from the following three aspects ### System-Level Optimization System-level optimization methods are often model-agnostic and do not change underlying learning algorithms. Therefore, they are widely used in training large-scale PTMs. * Single-Device Optimization: Solve with reduce redundant representation of floating-point numbers * Multi-Device Optimization: Data parallelism is preferred as long as it can conquer the excessive requirement of memory capacity ### Efficient Pre-Training * Training Methods * Model Architectures ### Model Compression * Parameter Sharing * Model Pruning: which cuts off some useless parts in PTMs to achieve accelerating while maintaining the performance * Knowledge Distillation: refers to the compression of higher-precision floating-point parameters to lower-precision floating-point ones ## Interpretation and Theoretical Analysis Beyond the superior performance of PTMs on various NLP tasks, researchers also explore to interpret the behaviors of PTMs ### Knowledge of PTMs * Linguistic knowledge: * **Representation Probing**: Fix the parameters of PTMs and train a new linear layer on the hidden representations of PTMs for a specific probing task * Representation Analysis: Use the hidden representations of PTMs to compute some statistics such as distances or similarities * Attention analysis: compute statistics about attention matrices and is more suitable to discover the hierarchical structure of texts * Generation Analysis: Use language models to directly estimate the probabilities of different sequences or words * World knowledge Learn rich world knowledge from pre-training, mainly including * Commonsense knowledge * Factual knowledge ### Robustness of PTMs Recent works have identified the severe robustness problem in PTMs using adversarial examples. Current works mainly utilize the model prediction, prediction probabilities, and model gradients of the models to search adversarial examples. ### Structural Sparsity of PTMs Transformer meets the problem of over parameterization. When removing part of attention heads, we can achieve better performance. Also, some papers show that theys can improvement the performance by simply duplicating some hidden layers to increase the model capacity ### Theoretical Analysis of PTMs It is effective to train a deep belief network by greedy layer-wise unsupervised pre-training followed by supervised fine-tuning , and **contrast learning** including language modeling has become the mainstream approach Some papers introduce the concept of latent classes and the semantically similar pairs are from the same latent class ## Future Directions * Architectures and pre-training methods * Architectures: We may need to carefully design task-specific architectures according to the type of downstream tasks. * Pre-Training Tasks: A more practical direction is to design more efficient self-supervised pre-training tasks and training methods according to the capabilities of existing hardware and software * Multilingual and multimodal pre-Training * Computational efficiency * Theoretical foundation * Modeledge learning A method can refer to the knowledge stored in PTMs as “modeledge”, which is distinguished from the discrete symbolic knowledge formalized by human beings. * Knowledge-Aware Tasks * Modeledge Storage and Management * Cognitive learning Making PTMs more knowledgeable is an important topic for the future of PTMs * Knowledge Augmentation * Knowledge Support * Knowledge Supervision * Cognitive Architecture * Explicit and Controllable Reasoning * Novel applications ## Conclusion The knowledge stored in PTMs is represented as real-valued vectors, which is quite different from the discrete symbolic knowledge formalized by human beings. Naming this continuous and machine-friendly knowledge **“modeledge”** and believe that it is promising to capture the modeledge in a more effective and efficient way and stimulate the modeledge for specific tasks.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully