Machine Translation and Encoder-Decoder Models

# Machine Translation and Encoder-Decoder Models ###### tags: `NLP` `preparatory` From [**Chapter 11: Machine Translation and Encode-Decoder Models**](https://web.stanford.edu/~jurafsky/slp3/11.pdf) in **Speech and Language Processing**, Dan Jurafsky and James h. Martin, *3rd ed.* - **Machine Translation** (**MT**) is the use of computers to translate from one language to another. - The most common current use of machine translation is for **information access**. - Another common use of machine translation is to aid human translators. - MT is used to produce a draft that is then fixed up in a **post-editing** phase by human translators. - The task is also called **computer-aided translation** (**CAT**). CAT is commonly used in **localization**. - The standard algorithm for MT is the **encoder-decoder network**, also called the **sequence to sequence** network, an architecture that can be implemented with RNNs or with Transformers. - Recall that RNNs and Transformers can be used to do **classification**, or **sequence labeling**. - Encoder-decoder/seq2seq models are used for a different kind of sequence modeling, - The output sequence is a complex function of the entire input sequencer; - We must map from a sequence of input words or tokens to a sequence of tags that are *not merely direct mappings from individual words*. - MT is such a task: the words of the target language don't necessarily agree with the words of the source language in *number* or *order*. - Encoder-decoder networks are very successfull at handling complicated cases of sequence mappings. - Aside from MT, it is also used in summarization, dialogue, semantic parsing, and many others. ## Language Divergences and Typology - Some aspects of human language seem to be universal, holding true for every language, or are statistical universals, holding true for most languages. - Many universals arise from the functional role of language as a communicative system by humans. - Every language, for example, seems to have words for referring to people, for talking about eating or drinking, for being polite or not. - There are also structural linguistic universals; - For example, every language seems to have nouns and verbs, etc. - Language also **differ** in many ways, and an understanding of what causes such **translation divergences** will help us build rather better MT models. - We often distinguish the **idiosyncratic** and lexical differences that must be dealt with one by one, - i.e., the word "dog" differs wildly from language to language - From **systematic** differences that we can model in a general way - i.e., any languages put the verb before the direct object; others put the verb after the direct object. - The study of these systematic cross-linguistic similarities and difference is called **linguistic typology**. - The World Atlas of Language Structures gives many typological facts about languages. ### Word Order Typology - Languages differ in the basic word order of verbs, subjects, and objects in simple declarative clauses. - For example, in **SVO** (**Subject-Verb-Object**) languages the verb tends to come between the subject and object. - German, French, English, and Mandarin. - By contrast, in **SOV** languages (Hindi and Japanese) the verb tends to come at the end of basic clauses, and Irish and Arabic are **VSO** languages. - Two languages that share their basic word order type often have other similarities. - For example, **VO** languages generally have **prepositions** (English), whereas **OV** language generally have **postpositions** (Japanese). - *Fig. 11.1* shows examples of some word order differences. All these differences can cause problems for translation, requiring the system to do huge structural reorderings as it generates the output. ![](https://i.imgur.com/L3dV7WR.png) ### Lexical Divergences - We also need to translate the individual words from one language to another. For any translation, the appropriate word can vary depending on the context. - For example, *bass* in English can appear in Spanish as the fish *lubina* or the musical instrument *bajo*. - In these cases, translating the words from English would require a kind of specialization, disambiguating the different uses of a word. - The fields of MT and Word Sense Disambiguation are closely linked. - Sometimes one language places more grammatical constraints on word choice than another. - For example, English marks nouns for whether they are singular or plural, but Mandarin doesn't. - The way that languages differ in lexically dividing up conceptual space may be more complex than the above one-to-many translation problem, leading to many-to-many mappings. - For example, Fig. 11.2 summarizes some of the complexities in translating English to French. ![](https://i.imgur.com/ILmcckI.png) - Further, one language may have a **lexical gap**, where no word or phrase, short of an explanatory footnote, can express the exact meaning of a word in the other language. - For example, English does not have a word that corresponds neatly to Mandarin xiao or Japanese oyakokoo (English equivalents are awkward phrases like *filial piety* or *loving child*, or *good son/daughter* for both). - Finally, languages may differ systematically in how the conceptual properties of an event are mapped onto specific words. - Languages can be characterized by whether direction of motion and manner of motion are marked on the verb or on the "satellites": particles, prepositioinal phrases, or adverbial phrases. - For example, a bottle floating out of a cave wold be described in English with the direction marked on the particle out, while in Spanish the direction would be marked on the verb: ![](https://i.imgur.com/fs7besI.png) - Verb-framed languages mark the direction of motion on the verb (leaving the satellites to mark the manner of motion). - Example: Spanish *acercarse* 'approach', *alcanzar* 'reach', *entrar* 'enter', *salir* 'exit' - Languages: Japanese, Tami, and languages in the Romance, Semitic, and Mayan languages families. - Satellite-framed languages mark the direction of motion on the satellite (leaving the verb to mark the manner of motion). - Example: English *crawl out*, *float off*, *jump down*, *run after* - Languages: Chinese, English, Swedish, Russian, Hindi, and Farsi. ### Morphological Typology - Morphologically, languages are often characterized along two dimensions of variation. - The first is the number of **morphemes** per word, ranging from: - **Isolating** languages like Vietnamese and Cantonese, in which each word generally has one morpheme, to - **Polysynthetic** languages like Siberian Yupik ("Eskimo"), in which a single word may have very many morphemes, corresponding to a whole sentence in English. - The second dimension is the degree to which morphemes are **segmentable**, ranging from: - **Agglutinative** languages like Turkish, in which morphemes have relatively clean boundaries, to - **Fusion** languages like Russion, in which a single affix may conflate multiple morphemes. - Translating between languages with rich morphology requires dealing with structure below the word level. - For this reason, systems generally use subword models like the wordpiece or BPE models. ### Referential density - Languages vary along a typological dimension related to the things they tend to omit. - Some languages, like English, require that we use an explicit pronoun when talking about a referent that is givein in the discourse. - In other languages, however, we can sometimes omit pronouns altogether. - Languages that can omit pronouns are called pro-drop languages. - Even among the pro-drop languages, there are marked differences in frequencies of omission. - For example, Japanese and Chinese tend to omit far more than does Spanish. - This dimension of variation across languages is called the dimension of **referential density**. - Languages that tend to use more pronouns are more **referentially dense** than those that use more zeros. - Referentially sparse languages, like Chinese or Japanese, that require the hearer to do more inferential work to recover antecedents are also called **cold** languages. - Languages that are more explicit and make it easier for the hearer are called **hot** languages. - Translating from languages with extensive pro-drop, like Chinese or Japanese, to non-pro-drop languages like English can be difficult. - Since the model must somehow idenfity each zero and recover who or what is being talked about in order to insert the proper pronoun. ## The Encoder-Decoder Model - **Encoder-decoder** networks, or **sequence-to-sequence** networks, are models capable of generationg contextually appropriate, arbitrary length, output sequences. - The key idea underlying these networks is the use of an **encoder** network that takes an input sequence and creates a contextualized representation of it, often called the **context**. - This representation is then passed to a decoder which generates a task-specific output sequence. - *Fig. 11.3* illustrates this architecture. ![](https://i.imgur.com/cXCCfhH.png) - Encoder-decoder networks consist of three components: - An **encoder** that accepts an input sequence, $x^n_1$, and generates a corresponding sequence of contextualized representations, $h^n_1$. - LSTMs, GRUs, convolutional networks, and Transformers can all be employed as encoders. - A **context vector**, $c$, which is a function of $h^n_1$, and conveys the essence of the input to the decoder. - A **decoder**, which accepts $c$ as input and generates an arbitrary lenth sequence of hidden states $h^m_1$, from which a corresponding sequence of output states, $y^m_1$, can be obtained. - Just as with encoders, decoders can be realized by any kind of sequence architecture. ## Encoder-Decoder with RNNs - Recall the conditional RNN language model, in which the model is passed the prefix of $t-1$ tokens and use the final hidden state to generate the next token at time $t$. - Formally, if $g$ is an activation function like $tanh$ or ReLU, a function of the input at time $t$ and the hidden state at time $t-1$, and $f$ is a softmax over the set of possible vocabulary items, then at time $t$ the output $y_t$ and the hidden state $h_t$ are computed as: \begin{align} h_t &= g(h_{t-1}, x_t) \\ y_t &= f(h_t) \end{align} - We only have to make one change to turn this language model with autoregressive generation into a translation model that can translate from a **source** text in one language to a **target** text in a second: - Add an sentence separation marker at the end of the source text, and then simply concatenate the target text. - If we call the source text $x$ and the target text $y$, we are computing the probability $p(y|x)$ as follows: \begin{align} p(y|x) = p(y_1|x)p(y_2|y_1,x)p(y_3|y_1,y_2,x)...P(y_m|y_1,...,y_{m-1},xx) \end{align} - *Fig. 11.4* shows the setup for a simplified version of the encoder-decoder model (without **attention**, which will be examined later). - To translate a source text, we run it through the network performing forward inference to generate hidden states until we get to the end of the source. - Then we begin *autoregressive generation*, asking for a work in the context of the hidden layer from the end of the source input as well as the end-of-sentence marker. - Subsequence words are conditioned on the previous hidden state and the embedding for the last word generated. ![](https://i.imgur.com/AO6t8Ip.png) - We formalize and generalize this model in *Fig. 11.5*. - The superscripts $e$ and $d$ are used to distinguish the hidden states of the encoder and the decoder. ![](https://i.imgur.com/Lg5d07b.png) - The elements of the network on the left process the input sequence $x$ and comprise the encoder. - Stacked architectures are the norm, where the output states from the top layer of the stack are taken as the final representation. - A widely used encoder design makes use of stacked biLSTMs. - The entire purpose of the encoder is to generate a contextualized representation of the input. - This representation is embodied in the final hidden state of the encoder, $h^e_n$; also called $c$ for **context**, this representation is then passed to the decoder. ![](https://i.imgur.com/ANgZgyS.png) - The **decoder** network on the right takes this state and uses it to initialize the first hidden state of the encoder. - That is, the first decoder RNN cell uses $c$ as its prior hidden state $h^d_0$. - The decoder autoregressively generates a sequence of outputs, an element at a time, until an end-of-sequence marker is generated. - The context vector $c$ is available at each step in the decoding process by adding it as a parameter to the computation of the current hidden state, using the following equation (*Fig. 11.6*): \begin{align} h^d_t = g(\hat{y}_{t-1}, h^d_{t-1}, c) \end{align} - This is to make sure that the influence of the context vector $c$ is maintained. - We look at the full equations for this version of the decoder in the basic encoder-decoder model, with context available at each decoding timestep. Recall that $g$ is a stand-in for some flavor of RNN and $\hat{y}_{t-1}$ is the embedding for the output sampled from the softmax at the previous step: \begin{align} c &= h^e_n \\ h^d_0 &= c \\ h^d_t &= g(\hat{y}_{t-1}, h^d_{t-1}, c) \\ z_t &= f(h^d_t) \\ y_t &= \text{softmax}(z_t) \end{align} - We compute the most likely output at each time step by taking the argmax over the softmax output: \begin{align} \hat{y}_t = \text{argmax}_{w \in V} P(w|x, y_1...y_{t-1}) \end{align} - One way to make the model a bit more powerful is conditioning the output layer $y$ not just solely on the hidden state $h^d_t$ and the context $c$ but also on the output $y_{t-1}$ generated at the previous time step: \begin{align} y_t = \text{softmax}(\hat{y}_{t-1}, z_t, c) \end{align} ### Training the Encoder-Decoder Model - Encoder-decoder architectures are trained end-to-end, just as with the RNN language models. - Each training example is a tuple of paired strings, a source and a target. - Concatenated with a separator token, these source-target pairs can now serve as training data. - For MT, the training data typically consists of sets of sentences and their translations. - These can be drawn from standard datasets of aligned sentence pairs. - Once we have a training set, the training itself proceeds as with any RNN-based language model. - The network is given the source text and then starting with the separator token is trained autoregressively to predict the next word (*Fig. 11.7*). ![](https://i.imgur.com/kuUs5gV.png) - Note the differences between training (*Fig. 11.7*) and inference (*Fig. 11.4*) with respect to the outputs at each time step. - During inference, the decoder will tend to deviate more and more from the gold target sentence as it keeps generating more tokens. - Since it uses its own estimated output $\hat{y_t}$ as the input for the next step $x_{t+1}$. - In training, therefore, it is more common to use teacher forcing in the decoder. - Meaning we force the system to use the gold target token from training as the next input $x_{t+1}$ rather than the decoder output $\hat{y_t}$. - This speeds up training. ## Attention - The simplicity of the encoder-decoder model is its clean separation of the encoder from the decoder, which represent the source text and uses said representation for tasks, respectively. - The context vector, which is currently the final hidden state of the encoder, is thus acting as a **bottleneck** (*Fig. 11.8*). - It must represent absolutely everything about the meaning of the source text, since that is the only thing that the decoder know about the source text. - Information at the beginning of the sequence, especially for long sentences, may not be equally well represented in the context vector. ![](https://i.imgur.com/yFaKsMF.png) - The **attention mechanism** is a solution to the bottleneck problem - It is way of allowing the decoder to get information from *all* the hidden states of the encoder, not just the last hidden state. - In the attention mechanism, the context vector $c$ is a single vector that is a function of the hidden states of the encoder: $c = f(h^n_1)$. - The idess of attention is to create a single fixed-length vector $c$ by taking the weighted sum of all the encoder hidden states $h^n_1$. - We can't use all encoder hidden state vectors due to its variability with the size of the input. - The weights are used to focus on a particular part of the source text that is relevant for the token currently being produced by the decoder. - The context vector produced by the attention mechanism is thus dynamic, different for each token in decoding. ### Attention mechanism - Attention replaces the static context vector with one that is dynamically derived from the encoder hidden states at each point during decoding. - This context vector, $c_i$, is generated anew with each encoding step $i$. - It takes all of the encoder hidden states into account in its derivation. - We then make this context available during decoding by conditioning the computation of the current decoder hidden state on it (along with the prior hidden state and the previous decoder-generated output) (*Fig. 11.9*): \begin{align} h^d_i = g(\hat{y}_{i-1}, h^d_{i-1}, c_i) \end{align} ![](https://i.imgur.com/LbBMyUk.png) - The first step in computing $c_i$ is to compute how *relevant* each encoder state is to the decoder state captured in $h^d_{i-1}$. - We capture relevance by computing - at each state $i$ during decoding - a $score(h^d_{i-1}, h^e_j)$ for each encoder state $j$. - The simplest score, called **dot-product attention**, implements relevance as similarity: measuring the similarity between the decoder and encoder hidden states by computing the dot product between them: \begin{align} score(h^d_{i-1}, h^e_j) = h^d_{i-1} \cdot h^e_j \end{align} - The vector of all scores across all the encoder hidden state gives us the relevance of each encoder state to the current step of the decoder. - To make use of these scores, we normalize them with a softmax to create a vector of weights, $\alpha_{ij}$, that tells us the proportional relevance of each encoder hidden state $j$ to the prior hidden decoder state, $h^d_{i-1}$: \begin{align} \alpha_{ij} &= \text{softmax}(score(h^d_{i-1}, h^e_j) \; \forall j \in e) \\ &= \frac{\text{exp}(score(h^d_{i-1}, h^e_j))}{\sum_k \text{exp}(score(h^d_{i-1}, h^e_k))} \end{align} - Finally, given the distribution in $\alpha$, we can compute a fixed-length context vector for all the current decoder state by taking a weighted average over all the encoder hidden states: \begin{align} c_i = \sum_j \alpha_{ij} h^e_j \end{align} - With this, we finally have a fixed-length context vector that takes into account information from the entire encoder state that is dynamically updated to reflect the needs of the decoder at each step of decoding - *Fig. 11.10* illustrates an encoder-decoder network with attention, focusing on the computation of one context vector $c_i$. ![](https://i.imgur.com/L8nKU06.png) - It's also possible to create more sophisticated scoring functions for attention models. - Instead of simple dot product attention, we can get a more powerful function that computes the relevance of each encoder to decoder hidden state by parameterizing the score with its own set of weights, $W_s$. \begin{align} score(h^d_{i-1}, h^e_j) = h^d_{t-1} W_s h^e_j \end{align} - The weights $W^s$, which are then trained during normal end-to-end training, give the network the ability to learn which aspects of similarity between the decoder and encoder states are important to the current application. - This bilinear model also allows the encoder and decoder to use different dimensional vectors. - The simple dot-product attention requires the encoder and decoder hidden states have the same dimensionality.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.