owned this note
owned this note
Published
Linked with GitHub
# Machine Translation and Encoder-Decoder Models
###### tags: `NLP` `preparatory`
From [**Chapter 11: Machine Translation and Encode-Decoder Models**](https://web.stanford.edu/~jurafsky/slp3/11.pdf) in **Speech and Language Processing**, Dan Jurafsky and James h. Martin, *3rd ed.*
- **Machine Translation** (**MT**) is the use of computers to translate from one language to another.
- The most common current use of machine translation is for **information access**.
- Another common use of machine translation is to aid human translators.
- MT is used to produce a draft that is then fixed up in a **post-editing** phase by human translators.
- The task is also called **computer-aided translation** (**CAT**). CAT is commonly used in **localization**.
- The standard algorithm for MT is the **encoder-decoder network**, also called the **sequence to sequence** network, an architecture that can be implemented with RNNs or with Transformers.
- Recall that RNNs and Transformers can be used to do **classification**, or **sequence labeling**.
- Encoder-decoder/seq2seq models are used for a different kind of sequence modeling,
- The output sequence is a complex function of the entire input sequencer;
- We must map from a sequence of input words or tokens to a sequence of tags that are *not merely direct mappings from individual words*.
- MT is such a task: the words of the target language don't necessarily agree with the words of the source language in *number* or *order*.
- Encoder-decoder networks are very successfull at handling complicated cases of sequence mappings.
- Aside from MT, it is also used in summarization, dialogue, semantic parsing, and many others.
## Language Divergences and Typology
- Some aspects of human language seem to be universal, holding true for every language, or are statistical universals, holding true for most languages.
- Many universals arise from the functional role of language as a communicative system by humans.
- Every language, for example, seems to have words for referring to people, for talking about eating or drinking, for being polite or not.
- There are also structural linguistic universals;
- For example, every language seems to have nouns and verbs, etc.
- Language also **differ** in many ways, and an understanding of what causes such **translation divergences** will help us build rather better MT models.
- We often distinguish the **idiosyncratic** and lexical differences that must be dealt with one by one,
- i.e., the word "dog" differs wildly from language to language
- From **systematic** differences that we can model in a general way
- i.e., any languages put the verb before the direct object; others put the verb after the direct object.
- The study of these systematic cross-linguistic similarities and difference is called **linguistic typology**.
- The World Atlas of Language Structures gives many typological facts about languages.
### Word Order Typology
- Languages differ in the basic word order of verbs, subjects, and objects in simple declarative clauses.
- For example, in **SVO** (**Subject-Verb-Object**) languages the verb tends to come between the subject and object.
- German, French, English, and Mandarin.
- By contrast, in **SOV** languages (Hindi and Japanese) the verb tends to come at the end of basic clauses, and Irish and Arabic are **VSO** languages.
- Two languages that share their basic word order type often have other similarities.
- For example, **VO** languages generally have **prepositions** (English), whereas **OV** language generally have **postpositions** (Japanese).
- *Fig. 11.1* shows examples of some word order differences. All these differences can cause problems for translation, requiring the system to do huge structural reorderings as it generates the output.

### Lexical Divergences
- We also need to translate the individual words from one language to another. For any translation, the appropriate word can vary depending on the context.
- For example, *bass* in English can appear in Spanish as the fish *lubina* or the musical instrument *bajo*.
- In these cases, translating the words from English would require a kind of specialization, disambiguating the different uses of a word.
- The fields of MT and Word Sense Disambiguation are closely linked.
- Sometimes one language places more grammatical constraints on word choice than another.
- For example, English marks nouns for whether they are singular or plural, but Mandarin doesn't.
- The way that languages differ in lexically dividing up conceptual space may be more complex than the above one-to-many translation problem, leading to many-to-many mappings.
- For example, Fig. 11.2 summarizes some of the complexities in translating English to French.

- Further, one language may have a **lexical gap**, where no word or phrase, short of an explanatory footnote, can express the exact meaning of a word in the other language.
- For example, English does not have a word that corresponds neatly to Mandarin xiao or Japanese oyakokoo (English equivalents are awkward phrases like *filial piety* or *loving child*, or *good son/daughter* for both).
- Finally, languages may differ systematically in how the conceptual properties of an event are mapped onto specific words.
- Languages can be characterized by whether direction of motion and manner of motion are marked on the verb or on the "satellites": particles, prepositioinal phrases, or adverbial phrases.
- For example, a bottle floating out of a cave wold be described in English with the direction marked on the particle out, while in Spanish the direction would be marked on the verb:

- Verb-framed languages mark the direction of motion on the verb (leaving the satellites to mark the manner of motion).
- Example: Spanish *acercarse* 'approach', *alcanzar* 'reach', *entrar* 'enter', *salir* 'exit'
- Languages: Japanese, Tami, and languages in the Romance, Semitic, and Mayan languages families.
- Satellite-framed languages mark the direction of motion on the satellite (leaving the verb to mark the manner of motion).
- Example: English *crawl out*, *float off*, *jump down*, *run after*
- Languages: Chinese, English, Swedish, Russian, Hindi, and Farsi.
### Morphological Typology
- Morphologically, languages are often characterized along two dimensions of variation.
- The first is the number of **morphemes** per word, ranging from:
- **Isolating** languages like Vietnamese and Cantonese, in which each word generally has one morpheme, to
- **Polysynthetic** languages like Siberian Yupik ("Eskimo"), in which a single word may have very many morphemes, corresponding to a whole sentence in English.
- The second dimension is the degree to which morphemes are **segmentable**, ranging from:
- **Agglutinative** languages like Turkish, in which morphemes have relatively clean boundaries, to
- **Fusion** languages like Russion, in which a single affix may conflate multiple morphemes.
- Translating between languages with rich morphology requires dealing with structure below the word level.
- For this reason, systems generally use subword models like the wordpiece or BPE models.
### Referential density
- Languages vary along a typological dimension related to the things they tend to omit.
- Some languages, like English, require that we use an explicit pronoun when talking about a referent that is givein in the discourse.
- In other languages, however, we can sometimes omit pronouns altogether.
- Languages that can omit pronouns are called pro-drop languages.
- Even among the pro-drop languages, there are marked differences in frequencies of omission.
- For example, Japanese and Chinese tend to omit far more than does Spanish.
- This dimension of variation across languages is called the dimension of **referential density**.
- Languages that tend to use more pronouns are more **referentially dense** than those that use more zeros.
- Referentially sparse languages, like Chinese or Japanese, that require the hearer to do more inferential work to recover antecedents are also called **cold** languages.
- Languages that are more explicit and make it easier for the hearer are called **hot** languages.
- Translating from languages with extensive pro-drop, like Chinese or Japanese, to non-pro-drop languages like English can be difficult.
- Since the model must somehow idenfity each zero and recover who or what is being talked about in order to insert the proper pronoun.
## The Encoder-Decoder Model
- **Encoder-decoder** networks, or **sequence-to-sequence** networks, are models capable of generationg contextually appropriate, arbitrary length, output sequences.
- The key idea underlying these networks is the use of an **encoder** network that takes an input sequence and creates a contextualized representation of it, often called the **context**.
- This representation is then passed to a decoder which generates a task-specific output sequence.
- *Fig. 11.3* illustrates this architecture.

- Encoder-decoder networks consist of three components:
- An **encoder** that accepts an input sequence, $x^n_1$, and generates a corresponding sequence of contextualized representations, $h^n_1$.
- LSTMs, GRUs, convolutional networks, and Transformers can all be employed as encoders.
- A **context vector**, $c$, which is a function of $h^n_1$, and conveys the essence of the input to the decoder.
- A **decoder**, which accepts $c$ as input and generates an arbitrary lenth sequence of hidden states $h^m_1$, from which a corresponding sequence of output states, $y^m_1$, can be obtained.
- Just as with encoders, decoders can be realized by any kind of sequence architecture.
## Encoder-Decoder with RNNs
- Recall the conditional RNN language model, in which the model is passed the prefix of $t-1$ tokens and use the final hidden state to generate the next token at time $t$.
- Formally, if $g$ is an activation function like $tanh$ or ReLU, a function of the input at time $t$ and the hidden state at time $t-1$, and $f$ is a softmax over the set of possible vocabulary items, then at time $t$ the output $y_t$ and the hidden state $h_t$ are computed as:
\begin{align}
h_t &= g(h_{t-1}, x_t) \\
y_t &= f(h_t)
\end{align}
- We only have to make one change to turn this language model with autoregressive generation into a translation model that can translate from a **source** text in one language to a **target** text in a second:
- Add an sentence separation marker at the end of the source text, and then simply concatenate the target text.
- If we call the source text $x$ and the target text $y$, we are computing the probability $p(y|x)$ as follows:
\begin{align}
p(y|x) = p(y_1|x)p(y_2|y_1,x)p(y_3|y_1,y_2,x)...P(y_m|y_1,...,y_{m-1},xx)
\end{align}
- *Fig. 11.4* shows the setup for a simplified version of the encoder-decoder model (without **attention**, which will be examined later).
- To translate a source text, we run it through the network performing forward inference to generate hidden states until we get to the end of the source.
- Then we begin *autoregressive generation*, asking for a work in the context of the hidden layer from the end of the source input as well as the end-of-sentence marker.
- Subsequence words are conditioned on the previous hidden state and the embedding for the last word generated.

- We formalize and generalize this model in *Fig. 11.5*.
- The superscripts $e$ and $d$ are used to distinguish the hidden states of the encoder and the decoder.

- The elements of the network on the left process the input sequence $x$ and comprise the encoder.
- Stacked architectures are the norm, where the output states from the top layer of the stack are taken as the final representation.
- A widely used encoder design makes use of stacked biLSTMs.
- The entire purpose of the encoder is to generate a contextualized representation of the input.
- This representation is embodied in the final hidden state of the encoder, $h^e_n$; also called $c$ for **context**, this representation is then passed to the decoder.

- The **decoder** network on the right takes this state and uses it to initialize the first hidden state of the encoder.
- That is, the first decoder RNN cell uses $c$ as its prior hidden state $h^d_0$.
- The decoder autoregressively generates a sequence of outputs, an element at a time, until an end-of-sequence marker is generated.
- The context vector $c$ is available at each step in the decoding process by adding it as a parameter to the computation of the current hidden state, using the following equation (*Fig. 11.6*):
\begin{align}
h^d_t = g(\hat{y}_{t-1}, h^d_{t-1}, c)
\end{align}
- This is to make sure that the influence of the context vector $c$ is maintained.
- We look at the full equations for this version of the decoder in the basic encoder-decoder model, with context available at each decoding timestep. Recall that $g$ is a stand-in for some flavor of RNN and $\hat{y}_{t-1}$ is the embedding for the output sampled from the softmax at the previous step:
\begin{align}
c &= h^e_n \\
h^d_0 &= c \\
h^d_t &= g(\hat{y}_{t-1}, h^d_{t-1}, c) \\
z_t &= f(h^d_t) \\
y_t &= \text{softmax}(z_t)
\end{align}
- We compute the most likely output at each time step by taking the argmax over the softmax output:
\begin{align}
\hat{y}_t = \text{argmax}_{w \in V} P(w|x, y_1...y_{t-1})
\end{align}
- One way to make the model a bit more powerful is conditioning the output layer $y$ not just solely on the hidden state $h^d_t$ and the context $c$ but also on the output $y_{t-1}$ generated at the previous time step:
\begin{align}
y_t = \text{softmax}(\hat{y}_{t-1}, z_t, c)
\end{align}
### Training the Encoder-Decoder Model
- Encoder-decoder architectures are trained end-to-end, just as with the RNN language models.
- Each training example is a tuple of paired strings, a source and a target.
- Concatenated with a separator token, these source-target pairs can now serve as training data.
- For MT, the training data typically consists of sets of sentences and their translations.
- These can be drawn from standard datasets of aligned sentence pairs.
- Once we have a training set, the training itself proceeds as with any RNN-based language model.
- The network is given the source text and then starting with the separator token is trained autoregressively to predict the next word (*Fig. 11.7*).

- Note the differences between training (*Fig. 11.7*) and inference (*Fig. 11.4*) with respect to the outputs at each time step.
- During inference, the decoder will tend to deviate more and more from the gold target sentence as it keeps generating more tokens.
- Since it uses its own estimated output $\hat{y_t}$ as the input for the next step $x_{t+1}$.
- In training, therefore, it is more common to use teacher forcing in the decoder.
- Meaning we force the system to use the gold target token from training as the next input $x_{t+1}$ rather than the decoder output $\hat{y_t}$.
- This speeds up training.
## Attention
- The simplicity of the encoder-decoder model is its clean separation of the encoder from the decoder, which represent the source text and uses said representation for tasks, respectively.
- The context vector, which is currently the final hidden state of the encoder, is thus acting as a **bottleneck** (*Fig. 11.8*).
- It must represent absolutely everything about the meaning of the source text, since that is the only thing that the decoder know about the source text.
- Information at the beginning of the sequence, especially for long sentences, may not be equally well represented in the context vector.

- The **attention mechanism** is a solution to the bottleneck problem
- It is way of allowing the decoder to get information from *all* the hidden states of the encoder, not just the last hidden state.
- In the attention mechanism, the context vector $c$ is a single vector that is a function of the hidden states of the encoder: $c = f(h^n_1)$.
- The idess of attention is to create a single fixed-length vector $c$ by taking the weighted sum of all the encoder hidden states $h^n_1$.
- We can't use all encoder hidden state vectors due to its variability with the size of the input.
- The weights are used to focus on a particular part of the source text that is relevant for the token currently being produced by the decoder.
- The context vector produced by the attention mechanism is thus dynamic, different for each token in decoding.
### Attention mechanism
- Attention replaces the static context vector with one that is dynamically derived from the encoder hidden states at each point during decoding.
- This context vector, $c_i$, is generated anew with each encoding step $i$.
- It takes all of the encoder hidden states into account in its derivation.
- We then make this context available during decoding by conditioning the computation of the current decoder hidden state on it (along with the prior hidden state and the previous decoder-generated output) (*Fig. 11.9*):
\begin{align}
h^d_i = g(\hat{y}_{i-1}, h^d_{i-1}, c_i)
\end{align}

- The first step in computing $c_i$ is to compute how *relevant* each encoder state is to the decoder state captured in $h^d_{i-1}$.
- We capture relevance by computing - at each state $i$ during decoding - a $score(h^d_{i-1}, h^e_j)$ for each encoder state $j$.
- The simplest score, called **dot-product attention**, implements relevance as similarity: measuring the similarity between the decoder and encoder hidden states by computing the dot product between them:
\begin{align}
score(h^d_{i-1}, h^e_j) = h^d_{i-1} \cdot h^e_j
\end{align}
- The vector of all scores across all the encoder hidden state gives us the relevance of each encoder state to the current step of the decoder.
- To make use of these scores, we normalize them with a softmax to create a vector of weights, $\alpha_{ij}$, that tells us the proportional relevance of each encoder hidden state $j$ to the prior hidden decoder state, $h^d_{i-1}$:
\begin{align}
\alpha_{ij} &= \text{softmax}(score(h^d_{i-1}, h^e_j) \; \forall j \in e) \\
&= \frac{\text{exp}(score(h^d_{i-1}, h^e_j))}{\sum_k \text{exp}(score(h^d_{i-1}, h^e_k))}
\end{align}
- Finally, given the distribution in $\alpha$, we can compute a fixed-length context vector for all the current decoder state by taking a weighted average over all the encoder hidden states:
\begin{align}
c_i = \sum_j \alpha_{ij} h^e_j
\end{align}
- With this, we finally have a fixed-length context vector that takes into account information from the entire encoder state that is dynamically updated to reflect the needs of the decoder at each step of decoding
- *Fig. 11.10* illustrates an encoder-decoder network with attention, focusing on the computation of one context vector $c_i$.

- It's also possible to create more sophisticated scoring functions for attention models.
- Instead of simple dot product attention, we can get a more powerful function that computes the relevance of each encoder to decoder hidden state by parameterizing the score with its own set of weights, $W_s$.
\begin{align}
score(h^d_{i-1}, h^e_j) = h^d_{t-1} W_s h^e_j
\end{align}
- The weights $W^s$, which are then trained during normal end-to-end training, give the network the ability to learn which aspects of similarity between the decoder and encoder states are important to the current application.
- This bilinear model also allows the encoder and decoder to use different dimensional vectors.
- The simple dot-product attention requires the encoder and decoder hidden states have the same dimensionality.