Bach Nguyen
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee
  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Machine Translation and Encoder-Decoder Models ###### tags: `NLP` `preparatory` From [**Chapter 11: Machine Translation and Encode-Decoder Models**](https://web.stanford.edu/~jurafsky/slp3/11.pdf) in **Speech and Language Processing**, Dan Jurafsky and James h. Martin, *3rd ed.* - **Machine Translation** (**MT**) is the use of computers to translate from one language to another. - The most common current use of machine translation is for **information access**. - Another common use of machine translation is to aid human translators. - MT is used to produce a draft that is then fixed up in a **post-editing** phase by human translators. - The task is also called **computer-aided translation** (**CAT**). CAT is commonly used in **localization**. - The standard algorithm for MT is the **encoder-decoder network**, also called the **sequence to sequence** network, an architecture that can be implemented with RNNs or with Transformers. - Recall that RNNs and Transformers can be used to do **classification**, or **sequence labeling**. - Encoder-decoder/seq2seq models are used for a different kind of sequence modeling, - The output sequence is a complex function of the entire input sequencer; - We must map from a sequence of input words or tokens to a sequence of tags that are *not merely direct mappings from individual words*. - MT is such a task: the words of the target language don't necessarily agree with the words of the source language in *number* or *order*. - Encoder-decoder networks are very successfull at handling complicated cases of sequence mappings. - Aside from MT, it is also used in summarization, dialogue, semantic parsing, and many others. ## Language Divergences and Typology - Some aspects of human language seem to be universal, holding true for every language, or are statistical universals, holding true for most languages. - Many universals arise from the functional role of language as a communicative system by humans. - Every language, for example, seems to have words for referring to people, for talking about eating or drinking, for being polite or not. - There are also structural linguistic universals; - For example, every language seems to have nouns and verbs, etc. - Language also **differ** in many ways, and an understanding of what causes such **translation divergences** will help us build rather better MT models. - We often distinguish the **idiosyncratic** and lexical differences that must be dealt with one by one, - i.e., the word "dog" differs wildly from language to language - From **systematic** differences that we can model in a general way - i.e., any languages put the verb before the direct object; others put the verb after the direct object. - The study of these systematic cross-linguistic similarities and difference is called **linguistic typology**. - The World Atlas of Language Structures gives many typological facts about languages. ### Word Order Typology - Languages differ in the basic word order of verbs, subjects, and objects in simple declarative clauses. - For example, in **SVO** (**Subject-Verb-Object**) languages the verb tends to come between the subject and object. - German, French, English, and Mandarin. - By contrast, in **SOV** languages (Hindi and Japanese) the verb tends to come at the end of basic clauses, and Irish and Arabic are **VSO** languages. - Two languages that share their basic word order type often have other similarities. - For example, **VO** languages generally have **prepositions** (English), whereas **OV** language generally have **postpositions** (Japanese). - *Fig. 11.1* shows examples of some word order differences. All these differences can cause problems for translation, requiring the system to do huge structural reorderings as it generates the output. ![](https://i.imgur.com/L3dV7WR.png) ### Lexical Divergences - We also need to translate the individual words from one language to another. For any translation, the appropriate word can vary depending on the context. - For example, *bass* in English can appear in Spanish as the fish *lubina* or the musical instrument *bajo*. - In these cases, translating the words from English would require a kind of specialization, disambiguating the different uses of a word. - The fields of MT and Word Sense Disambiguation are closely linked. - Sometimes one language places more grammatical constraints on word choice than another. - For example, English marks nouns for whether they are singular or plural, but Mandarin doesn't. - The way that languages differ in lexically dividing up conceptual space may be more complex than the above one-to-many translation problem, leading to many-to-many mappings. - For example, Fig. 11.2 summarizes some of the complexities in translating English to French. ![](https://i.imgur.com/ILmcckI.png) - Further, one language may have a **lexical gap**, where no word or phrase, short of an explanatory footnote, can express the exact meaning of a word in the other language. - For example, English does not have a word that corresponds neatly to Mandarin xiao or Japanese oyakokoo (English equivalents are awkward phrases like *filial piety* or *loving child*, or *good son/daughter* for both). - Finally, languages may differ systematically in how the conceptual properties of an event are mapped onto specific words. - Languages can be characterized by whether direction of motion and manner of motion are marked on the verb or on the "satellites": particles, prepositioinal phrases, or adverbial phrases. - For example, a bottle floating out of a cave wold be described in English with the direction marked on the particle out, while in Spanish the direction would be marked on the verb: ![](https://i.imgur.com/fs7besI.png) - Verb-framed languages mark the direction of motion on the verb (leaving the satellites to mark the manner of motion). - Example: Spanish *acercarse* 'approach', *alcanzar* 'reach', *entrar* 'enter', *salir* 'exit' - Languages: Japanese, Tami, and languages in the Romance, Semitic, and Mayan languages families. - Satellite-framed languages mark the direction of motion on the satellite (leaving the verb to mark the manner of motion). - Example: English *crawl out*, *float off*, *jump down*, *run after* - Languages: Chinese, English, Swedish, Russian, Hindi, and Farsi. ### Morphological Typology - Morphologically, languages are often characterized along two dimensions of variation. - The first is the number of **morphemes** per word, ranging from: - **Isolating** languages like Vietnamese and Cantonese, in which each word generally has one morpheme, to - **Polysynthetic** languages like Siberian Yupik ("Eskimo"), in which a single word may have very many morphemes, corresponding to a whole sentence in English. - The second dimension is the degree to which morphemes are **segmentable**, ranging from: - **Agglutinative** languages like Turkish, in which morphemes have relatively clean boundaries, to - **Fusion** languages like Russion, in which a single affix may conflate multiple morphemes. - Translating between languages with rich morphology requires dealing with structure below the word level. - For this reason, systems generally use subword models like the wordpiece or BPE models. ### Referential density - Languages vary along a typological dimension related to the things they tend to omit. - Some languages, like English, require that we use an explicit pronoun when talking about a referent that is givein in the discourse. - In other languages, however, we can sometimes omit pronouns altogether. - Languages that can omit pronouns are called pro-drop languages. - Even among the pro-drop languages, there are marked differences in frequencies of omission. - For example, Japanese and Chinese tend to omit far more than does Spanish. - This dimension of variation across languages is called the dimension of **referential density**. - Languages that tend to use more pronouns are more **referentially dense** than those that use more zeros. - Referentially sparse languages, like Chinese or Japanese, that require the hearer to do more inferential work to recover antecedents are also called **cold** languages. - Languages that are more explicit and make it easier for the hearer are called **hot** languages. - Translating from languages with extensive pro-drop, like Chinese or Japanese, to non-pro-drop languages like English can be difficult. - Since the model must somehow idenfity each zero and recover who or what is being talked about in order to insert the proper pronoun. ## The Encoder-Decoder Model - **Encoder-decoder** networks, or **sequence-to-sequence** networks, are models capable of generationg contextually appropriate, arbitrary length, output sequences. - The key idea underlying these networks is the use of an **encoder** network that takes an input sequence and creates a contextualized representation of it, often called the **context**. - This representation is then passed to a decoder which generates a task-specific output sequence. - *Fig. 11.3* illustrates this architecture. ![](https://i.imgur.com/cXCCfhH.png) - Encoder-decoder networks consist of three components: - An **encoder** that accepts an input sequence, $x^n_1$, and generates a corresponding sequence of contextualized representations, $h^n_1$. - LSTMs, GRUs, convolutional networks, and Transformers can all be employed as encoders. - A **context vector**, $c$, which is a function of $h^n_1$, and conveys the essence of the input to the decoder. - A **decoder**, which accepts $c$ as input and generates an arbitrary lenth sequence of hidden states $h^m_1$, from which a corresponding sequence of output states, $y^m_1$, can be obtained. - Just as with encoders, decoders can be realized by any kind of sequence architecture. ## Encoder-Decoder with RNNs - Recall the conditional RNN language model, in which the model is passed the prefix of $t-1$ tokens and use the final hidden state to generate the next token at time $t$. - Formally, if $g$ is an activation function like $tanh$ or ReLU, a function of the input at time $t$ and the hidden state at time $t-1$, and $f$ is a softmax over the set of possible vocabulary items, then at time $t$ the output $y_t$ and the hidden state $h_t$ are computed as: \begin{align} h_t &= g(h_{t-1}, x_t) \\ y_t &= f(h_t) \end{align} - We only have to make one change to turn this language model with autoregressive generation into a translation model that can translate from a **source** text in one language to a **target** text in a second: - Add an sentence separation marker at the end of the source text, and then simply concatenate the target text. - If we call the source text $x$ and the target text $y$, we are computing the probability $p(y|x)$ as follows: \begin{align} p(y|x) = p(y_1|x)p(y_2|y_1,x)p(y_3|y_1,y_2,x)...P(y_m|y_1,...,y_{m-1},xx) \end{align} - *Fig. 11.4* shows the setup for a simplified version of the encoder-decoder model (without **attention**, which will be examined later). - To translate a source text, we run it through the network performing forward inference to generate hidden states until we get to the end of the source. - Then we begin *autoregressive generation*, asking for a work in the context of the hidden layer from the end of the source input as well as the end-of-sentence marker. - Subsequence words are conditioned on the previous hidden state and the embedding for the last word generated. ![](https://i.imgur.com/AO6t8Ip.png) - We formalize and generalize this model in *Fig. 11.5*. - The superscripts $e$ and $d$ are used to distinguish the hidden states of the encoder and the decoder. ![](https://i.imgur.com/Lg5d07b.png) - The elements of the network on the left process the input sequence $x$ and comprise the encoder. - Stacked architectures are the norm, where the output states from the top layer of the stack are taken as the final representation. - A widely used encoder design makes use of stacked biLSTMs. - The entire purpose of the encoder is to generate a contextualized representation of the input. - This representation is embodied in the final hidden state of the encoder, $h^e_n$; also called $c$ for **context**, this representation is then passed to the decoder. ![](https://i.imgur.com/ANgZgyS.png) - The **decoder** network on the right takes this state and uses it to initialize the first hidden state of the encoder. - That is, the first decoder RNN cell uses $c$ as its prior hidden state $h^d_0$. - The decoder autoregressively generates a sequence of outputs, an element at a time, until an end-of-sequence marker is generated. - The context vector $c$ is available at each step in the decoding process by adding it as a parameter to the computation of the current hidden state, using the following equation (*Fig. 11.6*): \begin{align} h^d_t = g(\hat{y}_{t-1}, h^d_{t-1}, c) \end{align} - This is to make sure that the influence of the context vector $c$ is maintained. - We look at the full equations for this version of the decoder in the basic encoder-decoder model, with context available at each decoding timestep. Recall that $g$ is a stand-in for some flavor of RNN and $\hat{y}_{t-1}$ is the embedding for the output sampled from the softmax at the previous step: \begin{align} c &= h^e_n \\ h^d_0 &= c \\ h^d_t &= g(\hat{y}_{t-1}, h^d_{t-1}, c) \\ z_t &= f(h^d_t) \\ y_t &= \text{softmax}(z_t) \end{align} - We compute the most likely output at each time step by taking the argmax over the softmax output: \begin{align} \hat{y}_t = \text{argmax}_{w \in V} P(w|x, y_1...y_{t-1}) \end{align} - One way to make the model a bit more powerful is conditioning the output layer $y$ not just solely on the hidden state $h^d_t$ and the context $c$ but also on the output $y_{t-1}$ generated at the previous time step: \begin{align} y_t = \text{softmax}(\hat{y}_{t-1}, z_t, c) \end{align} ### Training the Encoder-Decoder Model - Encoder-decoder architectures are trained end-to-end, just as with the RNN language models. - Each training example is a tuple of paired strings, a source and a target. - Concatenated with a separator token, these source-target pairs can now serve as training data. - For MT, the training data typically consists of sets of sentences and their translations. - These can be drawn from standard datasets of aligned sentence pairs. - Once we have a training set, the training itself proceeds as with any RNN-based language model. - The network is given the source text and then starting with the separator token is trained autoregressively to predict the next word (*Fig. 11.7*). ![](https://i.imgur.com/kuUs5gV.png) - Note the differences between training (*Fig. 11.7*) and inference (*Fig. 11.4*) with respect to the outputs at each time step. - During inference, the decoder will tend to deviate more and more from the gold target sentence as it keeps generating more tokens. - Since it uses its own estimated output $\hat{y_t}$ as the input for the next step $x_{t+1}$. - In training, therefore, it is more common to use teacher forcing in the decoder. - Meaning we force the system to use the gold target token from training as the next input $x_{t+1}$ rather than the decoder output $\hat{y_t}$. - This speeds up training. ## Attention - The simplicity of the encoder-decoder model is its clean separation of the encoder from the decoder, which represent the source text and uses said representation for tasks, respectively. - The context vector, which is currently the final hidden state of the encoder, is thus acting as a **bottleneck** (*Fig. 11.8*). - It must represent absolutely everything about the meaning of the source text, since that is the only thing that the decoder know about the source text. - Information at the beginning of the sequence, especially for long sentences, may not be equally well represented in the context vector. ![](https://i.imgur.com/yFaKsMF.png) - The **attention mechanism** is a solution to the bottleneck problem - It is way of allowing the decoder to get information from *all* the hidden states of the encoder, not just the last hidden state. - In the attention mechanism, the context vector $c$ is a single vector that is a function of the hidden states of the encoder: $c = f(h^n_1)$. - The idess of attention is to create a single fixed-length vector $c$ by taking the weighted sum of all the encoder hidden states $h^n_1$. - We can't use all encoder hidden state vectors due to its variability with the size of the input. - The weights are used to focus on a particular part of the source text that is relevant for the token currently being produced by the decoder. - The context vector produced by the attention mechanism is thus dynamic, different for each token in decoding. ### Attention mechanism - Attention replaces the static context vector with one that is dynamically derived from the encoder hidden states at each point during decoding. - This context vector, $c_i$, is generated anew with each encoding step $i$. - It takes all of the encoder hidden states into account in its derivation. - We then make this context available during decoding by conditioning the computation of the current decoder hidden state on it (along with the prior hidden state and the previous decoder-generated output) (*Fig. 11.9*): \begin{align} h^d_i = g(\hat{y}_{i-1}, h^d_{i-1}, c_i) \end{align} ![](https://i.imgur.com/LbBMyUk.png) - The first step in computing $c_i$ is to compute how *relevant* each encoder state is to the decoder state captured in $h^d_{i-1}$. - We capture relevance by computing - at each state $i$ during decoding - a $score(h^d_{i-1}, h^e_j)$ for each encoder state $j$. - The simplest score, called **dot-product attention**, implements relevance as similarity: measuring the similarity between the decoder and encoder hidden states by computing the dot product between them: \begin{align} score(h^d_{i-1}, h^e_j) = h^d_{i-1} \cdot h^e_j \end{align} - The vector of all scores across all the encoder hidden state gives us the relevance of each encoder state to the current step of the decoder. - To make use of these scores, we normalize them with a softmax to create a vector of weights, $\alpha_{ij}$, that tells us the proportional relevance of each encoder hidden state $j$ to the prior hidden decoder state, $h^d_{i-1}$: \begin{align} \alpha_{ij} &= \text{softmax}(score(h^d_{i-1}, h^e_j) \; \forall j \in e) \\ &= \frac{\text{exp}(score(h^d_{i-1}, h^e_j))}{\sum_k \text{exp}(score(h^d_{i-1}, h^e_k))} \end{align} - Finally, given the distribution in $\alpha$, we can compute a fixed-length context vector for all the current decoder state by taking a weighted average over all the encoder hidden states: \begin{align} c_i = \sum_j \alpha_{ij} h^e_j \end{align} - With this, we finally have a fixed-length context vector that takes into account information from the entire encoder state that is dynamically updated to reflect the needs of the decoder at each step of decoding - *Fig. 11.10* illustrates an encoder-decoder network with attention, focusing on the computation of one context vector $c_i$. ![](https://i.imgur.com/L8nKU06.png) - It's also possible to create more sophisticated scoring functions for attention models. - Instead of simple dot product attention, we can get a more powerful function that computes the relevance of each encoder to decoder hidden state by parameterizing the score with its own set of weights, $W_s$. \begin{align} score(h^d_{i-1}, h^e_j) = h^d_{t-1} W_s h^e_j \end{align} - The weights $W^s$, which are then trained during normal end-to-end training, give the network the ability to learn which aspects of similarity between the decoder and encoder states are important to the current application. - This bilinear model also allows the encoder and decoder to use different dimensional vectors. - The simple dot-product attention requires the encoder and decoder hidden states have the same dimensionality.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully