# Deep Learning Architectures for Sequence Processing
###### tags: `NLP` `preparatory`
From **[Chapter 9: Deep Learning Architectures](https://web.stanford.edu/~jurafsky/slp3/9.pdf)** for Sequence Processing in **Speech and Language Processing**, Dan Jurafsky and James h. Martin, *3rd ed.*
- Language is an inherently temporal phenomenon, a sequence that unfolds in time.
- We process spoken language as continuous input streams of indefinite length.
- We normally process written text sequentially.
- This temporal nature is reflected in the algorithms we used to process language.
- Viterbi encoding POS tags, looking at one input word at a time.
- However, architectures such as feed-forward neural nets process all examples at once, failing to capture important temporal aspects of language.
- A work-around for these problems is the sliding window approach employed with neural language models.
- These models accept fixed-sized windows of tokens as input, and output predictions accordingly.
- **Note**: Decisions made in one window have no impact on subsequent decisions.
- This approach has two main problems:
- It limits the context from which information can be extracted, and
- It is difficult for networks to learn systematic patterns arising from phenomena like constituency.
- Two deep learning architectures are designed to address these problems:
- **Recurrent neural networks**, and
- **Transformer networks**.
- These archtecture have mechanisms to deal with both *variable length inputs* without using fixed-sized windows, alongside *capturing and exploiting the temporal nature of language*.
## Language Models Revisited
- The two mentioned architectures will be examined primarily through the lens of **probabilistic language models**.
- Probabilistic language models predict the next word in a sequence given some preceding context (i.e., computing $P(\text{cute}|\text{That cat looks})$).
- These models give us the ability to assign a *conditional probability* to every possible next word, giving us a *distribution* over the entire vocabulary.
- We can also assign probabilities to entire sequences by using these conditional probabilities in combination with the *chain rule*.
- Both *N*-gram models and feedforward neural nets with sliding windows are constrained by a Markov assumption.
- **Assumption**: The prediction is based on a fixed preceding context of size $N$; any input that occured earlier has no bearing on the outcome.
- The new methods will relax this assumption, allowing the models to make use of much larger contexts.
- We evaluate language models by examining how well they predict unseen data drawn from the same source as the training data.
- Intuitively, good models are those that *assign higher probabilities to unseen data*.
- For a concrete definition, we use **perplexity** as a measure of model quality.
- The perplexity $PP$ of a model $\theta$ with respect to an unseen test set is the probability the model assigns to it, normalized by its length.
$$
\begin{gather}
PP_\theta(w_{1:n}) = P(w_{1:n})^{1/n}
\end{gather}
$$
- The information theory alternative is defined in term of entropy (the value in the exponent is the *cross-entropy* of our current model w.r.t the true distribution):
$$
\begin{gather}
PP(w_{1:n}) = 2^{H(w_{1:n})} = 2^{-\frac{1}{n}\sum_1^n\log_2m(w_n)}
\end{gather}
$$
- Another way to assess a language model is to use it to *generate novel sequences*.
- The extent to which a generated sequence mirrors the training data is an indication of the quality of the model.
- We do this using an approach called **autoregressive generation**:
- First, randomly sample the first word of a sequence based on its suitability to be one.
- Then, having sampled the first word, we sample further words *conditioned on our previous choices*.
- Continue until we reach a pre-defined length, or an end of sequence token is generated.
## Recurrent Neural Networks
- A **recurrent neural network** (**RNN**) is any network that contains a cycle within its network connections.
- The value of a unit within a RNN is directly, or indirectly, dependent on its own earlier outputs as an input.
- A class of recurrent networks is **Elman Networks**, or **simple recurrent networks**.
- These network are useful on their own, while serving as a basis for more complex approaches like the *Long Short-Term Memory (LSTM)*.

- The key difference of a RNN compared to a feedforward network lies in its recurrent link.
- This link augments the input to the computation at the hidden layer with the value of the hidden layer *from the preceding point in time*.
- The hidden layer from the previous time step *provides a form of memory*, or context, that encodes earlier processing and informs the decisions to be made at later points in time.
- Critically, this approach *does not impose a fixed-length limit* on the mentioned prior context.
- The context embodied in the previous hidden layer includes information extending back to the beginning of the sequence.
- Adding a temporal dimension makes RNNs appear to be more complex than non-recurrent architectures, while in reality they are not all that different.
- We are still performing the standard feedforward calculation, albeit with a new set of weights and inputs.
- The new set of weights, $U$, connect the hidden layer from the previous time step to the current hidden layer; determining how the network makes use of past context in calculating the output for the current input.
- These weights are also trained via *backpropagation*.

### Inference in RNNs
- **Forward inference** (mapping a sequence of inputs to a sequence of outputs) in an RNN is nearly identical to the procedure in feedforward networks.
- To compute an output $y_t$ for an input $x_t$, we need the activation value for the hidden layer $h_t$.
- In an RNN with recurrent weight matrix $U$, input weight matrix $W$, and output weight matrix $V$ we proceed with the usual output vector calculation:
$$
\begin{align}
h_t &= g(Uh_{t-1} + Wx_t) \\
y_t &= f(Vh_t)
\end{align}
$$
- Specifying the dimensions of the matrices is important. Let $d_{in}$, $d_h$, and $d_{out}$ be the input, hidden and output layer dimensions, respectively:
- Input weight matrix $W \in \text{R}^{d_h \times d_{in}}$,
- Recurrent weight matrix $U \in \text{R}^{d_h \times d_h}$, and
- Output weight matrix $V \in \text{R}^{d_{out} \times d_h}$.
- In the commonly encountered case of soft classification, computing $y_t$ consists of a softmax calculation: $y_t = \text{softmax}(Vh_t)$
- Computation at time $t$ requires the value of the hidden layer at time $t-1$, mandating an incremental inference algorithm that processes from the start to the end of the sequence as illustrated in *Fig. 9.4*.
- The sequential nature of simple recurrent networks can also be seen by *unrolling* the network in time as shown in *Fig. 9.5* (the weight matrices are shared across time).


### Training
- As with feedforward networks, we'll use a *training set*, a *loss function*, and *backpropagation* to obtain the *gradients* needed to update the weights.
- We have 3 sets of weights to update: $W$, the input weights, $U$, the recurrent weights, and $V$, the output weights.
- Compared to feedforward networks, we have to take two considerations when performing backpropagation on RNNs:
- To compute the loss function for the output at time $t$ we need the hidden layer from time $t-1$, and
- The hidden layer at time $t$ influences both the output at time $t$ and the hidden layer at time $t+1$ (and hence the output and loss at $t+1$).
- Therefore, to assess the error accruing to $h_t$, we'll need to know its influence on both the current output *as well as the ones that follow*.
- A general training approach used with RNNs is called **Backpropagation Through Time**.
- It is a two-pass algorithm for training the weights.
- In the first pass, we perform forward inference,
- Computing $h_t$, $y_t$,
- Accumulating the loss at each step in time, and
- Saving the value of the hidden layer at each step for use at the next time step.
- In the second phase, we process the sequence in reverse, computing the gradients as we go, computing and saving the error term for use in the hidden layer *for each step backward in time*.
- Like in *Fig 9.5*, explicitly *unrolling* a recurrent network into a feedforward computational graph eliminates any explicit recurrences, *allowing the network weights to be trained directly*.
- This discard the need for a specialized approach to training RNNs.
- In this approach, we provide a template that specifies the basic structure of the network.
- This includes necessary parameters for I/O + hidden layers, weight matrices, the activation and output function.
- Then, when presented with a specific input sequence, we can generate an unrolled feedforward network specific to that input, and use that graph to perform forward inference or ++training via ordinary backpropagation++.
- For applications that involve much longer input sequences, unrolling an entire input sequence may not be feasible.
- We can unroll the input into manageable fixed-length segments and treat each segment as a distinct training item.
### RNNs as Language Models
- RNN-based language models process sequences a word at a time, attempting to predict the next word in a sequence by using the current word and the previous hidden state as inputs.
- The limited context constraint in N-gram models is avoided since the hidden state embodies context information dated all the way back to the beginning of the sequence.
- *Forward inference* in a recurrent language model:
- Input sequence $x$ consists of word embedding (one-hot) vectors of size $|V| \times 1$,
- Output predictions $y$: vectors representing a probability distribution over the vocabulary.
- At each step, the model uses word embedding matrix $E$ to retrieve the embedding for the current word, and then combine with the hidden layer from the previous step to compute the current hidden layer.
- This hidden layer is then used to generate an output layer, passed through a softmax function to generate $y$. That is, at time $t$:
\begin{align}
e_t &= E^t x_t \\
h_t &= g(Uh_{t-1} + We_t) \\
y_t &= \text{softmax}(Vh_t)
\end{align}
- Given $y$, the probability of a word in the vocabulary, $i$, as the next word is:
\begin{gather}
P(w_{t+1} = i|w_{1:t}) = y_t^i
\end{gather}
- Therefore, the probability of an entire sequence is just the product of the probabilities of each item in the sequence:
\begin{align}
P(w_{1:n}) &= \prod_{i=1}^n P(w_i|w_{1:i-1}) \\
&= \prod_{i=1}^n y_{w_i}^i
\end{align}
- To train an RNN as a language model, we use a corpus of text as training material in combination with a training regiment called **teacher forcing**.
- The task is to minimize the error in predicting the next word in the training sequence, using *cross-entropy as the loss function*.
- The cross-entropy loss measures the difference between a predicted probability distribution and the correct distribution:
\begin{gather}
L_{CE} = -\sum_{w \in V} y_w^t \log y_w^\hat{t}
\end{gather}
- The correct distribution $y$ comes from knowing the next word, which is represented as a one-hot vector (correct word is 1, others are 0). Therefore, at time $t$ the CE loss is:
\begin{gather}
L_{CE}(\hat{y_t}, y_t) = -\log \hat{y}_{w_{t+1}}^t
\end{gather}
- In practice, the weights in the network are adjusted to minimize the average CE loss over the training sequence via *gradient descent* (training regimen illustrated *Fig. 9.6*).

- **Weight Tying** is a method that dispenses with the redundancy and uses a single set of embeddings at the input and softmax layers.
- The redundancy is the resemblance of the final layer matrix $V$ to the initial embeddings $E$.
- Therefore, the embedding matrix shape $E$ is $|V| \times d_h$.
- Weight tying set $E = V$,
- Setting the dimensionality of the final hidden layer to be the same $d_h$, and simply use the same matrix for both layers.
- Weight tying provides improved perplexity results, and significantly reduces the number of parameters.
#### Generation with RNN-based Language Models
- We generate novel sequences using RNN with a technique called **autoregressive generation**.
- That is, the word generated at each time step is conditioned on the word selected by the network on the previous time step (*Fig. 9.7*).
- This architecture has inspired applications such as machine translation, summarization, and question answering. The key is to prime the generation component with an appropriate context.

### Other Applications of RNNs
- RNNs is an effective approach to language modeling, sequence labeling tasks (POS tagging), as well as sequence classification tasks (sentiment analysis/topic classification).
#### Sequence Labeling
- In **sequence labeling**, the network's task is to assign a label chosen from a small fixed set of labels to each element of a sequence.
- ++Examples++: POS tagging, named entity recognition
- In an RNN approach to sequence labeling, inputs are word embeddings and the outputs are tag probabilities generated by a softmax layer over the given tagset (*Fig. 9.8*).
- Since we are using a softmax layer to generate a distribution of tags, *cross-entropy loss* is used during training of the unrolled RNN.

### RNNs for Sequence Classification
- Another use of RNNs is to classify entire sequences rather than the tokens within them.
- ++Examples++: sentiment analysis, document-level topic classification, spam detection, etc.
- Sequences of text are classified as belonging to one of a small number of categories.
- To apply RNNs in this setting, the text to be classified is passed through the RNN a word at a time generating a new hidden layer at each time step.
- $h_n$, the hidden layer for the final element of the text, is taken to constitute a compressed representation of the entire sequence.
- In this simplest classification approach, $h_n$ serves as the input to a subsequent feedforward network that chooses a class via a softmax over all possible classes (*Fig. 9.9*).

- Since there is no intermediate outputs from preceding elements in the sequence, the loss function used for weights training is based entirely on the final classification task.
- The error signal is backpropagated all the wey through the weights in the feedforward classifier, to its input, and through the RNN's three sets of weights.
- This combination of a simple recurrent network with a feedforward classifier is the first example of a *deep neural network*.
- **End-to-end Training**: The training regimen that uses the loss from a downstream application to adjust the weights all the way though the network.
### Stacked and Bidirectional RNNs
- RNNs are flexible.
- By combining the feedforward nature of unrolled computational graphs with vectors as common inputs and outputs, complex networks can be treated as modules that can be combined in creative ways.
#### Stacked RNNs
- **Stacked RNNs** consist of multiple networks where the output of one layer serves as the input to a subsequent layer (*Fig. 9.10*).

- Stacked RNNs can ++outperform++ single-layer networks.
- One reason is that the network is capable of *inducing representations at differing levels of abstraction across layers*.
- Initial layers of stacked networks can induce representations that are useful abstractions for further layers.
- These representations are difficult to induce in a single RNN.
- The optimal number of stacked RNNs is specific to each application and to each training set.
- As the number of stacks is increased, the training costs rise quickly.
#### Bidirectional RNNs
- In a simple recurrent network, the hidden state at a given time $t$ represents everything the network knows about the sequence up to that point in the sequence.
- The hidden state at time $t$ is the result of a function of the inputs from the start up through time $t$.
- This can be defined as the context of the network *to the left* of current time; let $h_t^f$ be the normal hidden state at time $t$:
\begin{gather}
h_t^f = RNN_{forward}(x_1^t)
\end{gather}
- In many applications, we have access to the entire input sequence all at once.
- This raise the question whether it's helpful to take advantage of the context *to the right* of the current input as well.
- One way to recover such information is to train an RNN on an input sequence in reverse.
- The hidden state at time $t$ now represents information about the sequence to the right of the current input; let $h_t^b$ represents all information we have discerned about the sequence from $t$ to the end of the sequence:
\begin{gather}
h_t^b = RNN_{backward}(x_t^n)
\end{gather}
- Combining the forward and backward networks results in a **bidirectional RNN**.
- A Bi-RNN consists of two independent RNNs, one where the input is processed from the start to the end, and the other from the end to the start.
- We then combine the outputs of the two networks into a single representation that captures both the left and right contexts of an input at each point in time.
\begin{gather}
h_t = h_t^f \oplus h_t^b
\end{gather}
- Fig. 9.11 illustrates a bidirectional network where the outputs of the forward and backward pass are concatenated.
- Other simple combination methods include element-wise addition/multiplication.
- The output at each step in time captures information to the left and to the right of the current input.
- In sequence labeling applications, these concatenated outputs can serve as the basis for a local labeling decision.

- Bi-RNNs have also proven to be quite effective for *sequence classification*.
- Recall from *Fig. 9.9*, the final hidden state is the input to a feedforward classifier.
- A difficulty is the final state naturally reflects more information about the end of the sentence than its beginning.
- Bi-RNNs solves this by simply combining the final hidden states from the forward and backward passes and use that input for follow-on processing (*Fig. 9.12*).
- Concatenation is a common combination strategy, but element-wise summation/multiplication or averaging are also used.

## Managing Context in RNNs: LSTMs and GRUs
- In practice, it is quite difficult to train RNNs for tasks that require a network to make use of *information distant from the current point of processing*.
- Despite having access to the entire preceding sequence, the information encoded in hidden states tends to be fairly local, more relevant to the most recent parts of the input sequence and recent decisions.
- However, distant information is usually critical to many language applications.
- There are two reasons RNNs are incapable of carrying forward critical information:
- The hidden layers and their associated weights are being asked to perform two tasks simultaneously:
- Provide information useful for the current decision, and
- Updating and carrying forward information required for future decisions.
- To train RNNs, the error signal need to be *backpropagated through time*.
- During the backward part of training, the hidden layers are subjected to repeated multiplications, as determined by the length of the sequence.
- A frequent result of this process is that the gradients are eventually driven to zero - the **vanishing gradients** problem.
- To address these issues, more complex architectures have been designed to explicitly manage the task of *maintaining relevant context over time*.
- The network needs to learn to forget information that is no longer needed and to remember information required for decisions to come.
### Long Short-Term Memory
*External Reading:* [**Understanding LSTM Networks**](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- **Long short-term memory** networks divide the context management problem into two sub-problems:
- Removing information no longer needed from the context, and
- Adding information likely to be needed for later decision making.
- The key to solving both problems is to learn how to manage the context rather than hard-coding a strategy into the architecture.
- LSTMs accomplish this by first adding an *explicit context layer* to the architecture (in addition to the usual recurrent hidden layer), and
- Use *gates* to control the flow of information into and out of the units that comprise the network layers.
- These gates are implemented through the use of additional weights that operate sequentially on the input, and previous hidden layers, and previous context layers.
- The **gates** in an LSTM share a common design pattern;
- Each consists of a (sequentially):
- Feedforward layer,
- Sigmoid activation function (to push the outputs to either 0 or 1),
- Pointwise multiplication with the layer being gated.
- The combination of these parts creates a structure similar to a binary mask.
- Values in the layer being gated that align with values near 1 in the mask are passed through nearly unchanged, and
- Values corresponding to lower values are essentially erased.
- **Forget gate** deletes information from the context that is no longer needed.
- It computes a weighted sum of the previous state's hidden layer and the current input and passes that through a sigmoid, then multiplies this mask by the context vector to remove unneeded information:
\begin{align}
f_t &= \sigma(U_f h_{t-1} + W_f x_t) \\
k_t &= c_{t-1} \odot f_t
\end{align}
- The next task is compute the actual information we need to extract from the previous hidden state and current inputs:
\begin{gather}
g_t = \tanh(U_g h_{t-1} + W_g x_t)
\end{gather}
- Next, we generate the mask for the **add gate** to select the information to add to the current context:
\begin{align}
i_t &= \sigma(U_i h_{t-1} + W_i x_t) \\
j_t &= g_t \odot i_t
\end{align}
- Next, we add this to the modified context vector to get our new context vector:
\begin{gather}
c_t = j_t + k_t
\end{gather}
- The final gate we'll use is the **output gate** which is used to decide what information is required for the current hidden state (as opposed to what information needs to be preserved for future decisions):
\begin{align}
o_t &= \sigma(U_o h_{t-1} + W_o x_t) \\
h_t &= o_t \odot \tanh(c_t)
\end{align}
- *Fig. 9.13* illustrates the complete computation for a single LSTM unit.
- Given the appropriate weights for the various gates, an LSTM accept as input:
- The *context layer*,
- The hidden layer from the previous time step, along with
- The current input vector.
- It then generates *updated context* and hidden vectors as output.
- The hidden layer, $h_t$, can be used as input to subsequent layers in a stacked RNN, or to generate an output for the final layer of a network.

### Gated Recurrent Units
- LSTMs introduce a considerable number of additional parameters to our recurrent networks.
- We have 8 sets of weights to learn (i.e., the $U$ and $W$ for each of the 4 gates within each unit), whereas with simple recurrent units we only had 2.
- Training these additional parameters imposes a significantly higher training cost.
- **Gated Recurrent Units** (**GRUs**) ease this burden by dispensing with the use of a separate context vector, and by reducing the number of gates to 2 - a reset gate, $r$, and an update gate, $z$:
\begin{align}
r_t &= \sigma(U_r h_{t-1} + W_r x_t) \\
z_t &= \sigma(U_z h_{t-1} + W_z x_t)
\end{align}
- As with LSTMs, the use of the sigmoid in the design of gates results in a *binary-like mask*.
- Blocks information with values near zero, or allows information to pass through unchanged with values near one.
- The purpose of the **reset gate** is to filter aspects from the previous hidden state that are relevant to the current context.
- This is done by performing an element-wise multiplication of $r$ with the value of the previous hidden state.
- We then use the masked value to compute an intermediate representation for the new hidden state at time $t$:
\begin{align}
\tilde{h_t} = \tanh(U(r_t \odot h_{t-1}) + W x_t)
\end{align}
- The job of the **update gate** $z$ is to determine which aspects of this new state will be used directly in the new hidden state and which aspects of the previous state need to be preserved for future use.
- This is done by using the values in $z$ to interpolate between the old hidden state and the new one:
\begin{align}
h_t = (1 - z_t) h_{t-1} + z_t \tilde{h_t}
\end{align}

### Gated Units, Layers and Networks
- The neural units used in LSTMs and GRUs are much more complex than those used in basic feedforward networks.
- This complexity is encapsulated within the basic processing units, allowing modularity and experimentations.
- *Fig. 9.14* illustrates the inputs and outputs associated with each kind of unit.

- This modularity is key to the power and widespread applicability of LSTM and GRU units.
- They can be stacked in a SRNs configuration or combined in different ways.
- Multi-layered networks making use of gated units can be unrolled into deep feedforward networks and trained in the usual fashion with *backpropagation*.
## Self-Attention Networks: Transformers
Detailed summary note: [**Self-Attention Networks: Transformers**](https://hackmd.io/@bachnguyenTE/NLP_transformers)
## Potential Harms from Language Models
- Problems may occur whenever language models are used for text generation.
- Language models can generate toxic language.
- Many kinds of completely non-toxic prompts can nonetheless lead large language models to output hate speech and abuse.
- Large language models generate sentences displaying negative attitudes toward minority identities such as being Black or gay.
- Language models are biasied in a number of ways by the distributions of their training data.
- There are also important privacy issues. Language models, like other machine learning models, can **leak** information about their training data.
- Mitigating all these harms is an important but unsolved research question in NLP.
## Summary
- In simple Recurrent Neural Networks sequences are processed naturally as an element at a time.
- The output of a neural unit at a particular point in time is based both on the current input and value of the hidden layer from the previous time step.
- RNNs can be trained with a straightforward extension of the backpropagation algorithm, known as backpropagation through time (BPTT).
- Common language-based applications for RNNs include:
- Probabilistic language modeling, where the model assigns a probability to a sequence, or to the next element of a sequence given the preceding words.
- Auto-regressive generation using a trained language model.
- Sequence labeling, where each element of a sequence is assigned a label, as with part-of-speech tagging.
- Sequence classification, where an entire text assigned to a category, as in spam detection, sentiment analysis or topic classification.
- Simple recurrent networks often fail since it is extremely difficult to successfully train them due to problems maintaining useful gradients over time.
- More complex gated architectures such as LSTMs and GRUs are designed to overcome these issues by explicitly managing the task of deciding what to remember and forget in their hidden and context layers.