owned this note
owned this note
Published
Linked with GitHub
# Attention and Transformer Models
## Introduction
The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," has revolutionized natural language processing (NLP) and various other domains. Its ability to process sequences in parallel and capture long-range dependencies has led to significant advancements in tasks like machine translation, text summarization, and question answering. Transformers have extended their reach beyond NLP, making valuable contributions in computer vision, speech recognition, and music generation.
### Key Advantages of Transformers
* **Parallel Processing:** Unlike RNNs, Transformers process the entire sequence simultaneously, resulting in faster training and inference.
* **Long-Range Dependencies:** The attention mechanism allows Transformers to effectively handle relationships between words that are far apart in the sequence, a challenge for RNNs.
* **Interpretability:** The attention weights provide insights into which parts of the input are most important for each output, enhancing model interpretability.
| Feature | RNN-based Models | Transformer |
|---------|----------------------------|----------------------------------|
| Input processing | Sequential (one word/item at a time) | Parallel (entire sequence at once) |
| Long-range dependencies | Challenging to capture accurately | Effectively handled through the attention mechanism |
| Parallelizability | Low | High |
|Input selectivity | Limited | High (via attention mechanism) |
| Computational efficiency | Less efficient, especially with long sequences | More efficient, particularly for long sequences, due to parallelization and attention |
### Overview
This tutorial covers the core components of Transformers:
* **Attention Mechanisms:** Concept, types, and functions of attention.
* **Self-Attention and Multi-Head Attention:** How these mechanisms enable focus on relevant input parts.
* **Positional Encoding:** How positional information is incorporated into the model.
* **Transformer Architecture:** Encoder and decoder components, and their internal layers.
* **Practical Example:** Building an English-to-Chinese translator using a Transformer.
## Attention Mechanisms: The Building Blocks
Attention mechanisms are the heart and soul of Transformer architectures. They revolutionize how models process sequential data by enabling them to focus on the most relevant parts of the input, thus enhancing understanding and performance across diverse tasks.
### Core Components of Attention
Attention operates on three fundamental elements:
* **Query $Q$:** Represents a request or context for information. Think of it as the question you want to ask of your data.
* **Key $K$:** Represents a set of labels or identifiers associated with values in the data. Imagine these as the categories or topics under which your data is organized.
* **Value $V$:** Represents the actual information or content associated with each key. These are the answers or details corresponding to your query and the relevant keys.
The attention mechanism cleverly calculates a weighted sum of the values. The weights, known as attention scores, quantify how important each value is in the context of the given query.
### Attention Calculation
Formally, given a database of $(key, value)$ pairs, $D = \{(k_1,v_1),\dots(k_n,v_n) \}$, and a query vector $q$, the attention mechanism produces an output as follows:
$$
\text{Attention}(q,D)=\sum_i{\alpha'(q,k_i)}v_i
$$
Here, $\alpha'(q,k_i)$ signifies the attention weight assigned to the i-th key-value pair with respect to the query. The higher the attention weight, the more relevant that key-value pair is considered to be.
### Attention Scoring Functions: Gauging Relevance
Various scoring functions measure the similarity between a query and keys, leading to diverse attention weight distributions. Popular choices include:
* **Dot Product:** $\alpha(\mathbf{q},\mathbf{k}^i)=\mathbf{q}^{T}\mathbf{k}^i$ A simple and efficient method, but susceptible to instability with high-dimensional inputs due to the potential for large values.
* **Scaled Dot Product:** $\alpha(\mathbf{q},\mathbf{k}^i)=\frac{\mathbf{q}^{T}\mathbf{k}^i}{\sqrt d}$ Addresses the scaling issue of the dot product by dividing it by the square root of the key dimension $d$. This is the default choice in Transformers.
* **Additive:** $\alpha(\mathbf q, \mathbf k^i) = \mathbf w_v^\top \textrm{tanh}(\mathbf W_q\mathbf q + \mathbf W_k \mathbf k^i)$ Offers greater flexibility to learn complex similarity patterns, but is computationally more expensive.
| Function | Pros | Cons |
|---------|----------------------------|----------------------------------|
| Dot Product | Simple, efficient | Potential for large values in high dimensions |
| Scaled Dot Product | Balances computation and stability | Slightly more expensive than dot product |
| Additive | Captures complex patterns | Computationally demanding |
### Self-Attention: The Sequence Attends to Itself
Self-attention is a specialized type of attention that enables a sequence to focus on its own internal structure and relationships. In this mechanism, each element of a sequence becomes both the query and the key-value pair. This allows each element to assess its relevance to other elements within the same sequence, capturing dependencies regardless of their distance.
#### How Self-Attention Works
Let's break down how self-attention operates on an input sequence of length $l$: $\mathbf{a}^1, \mathbf{a}^2, \dots, \mathbf{a}^l$
1. **Query $Q$, Key $K$, and Value $V$ Generation:** For each element a $i$ in the input sequence, three vectors are generated through linear transformations using trainable weight matrices $W^q$, $W^k$, and $W^v$: \begin{split}\mathbf{q}^i &= W^q\mathbf{a}^i\\ \mathbf{k}^i &= W^k\mathbf{a}^i\\ \mathbf{v}^i &= W^v\mathbf{a}^i\end{split}
2. **Attention Score Calculation:** We compute the attention scores by applying a chosen scoring function (e.g., the scaled dot product) to each query-key pair $(\mathbf{q}^j, \mathbf{k}^i)$. These scores reflect how much each value $\mathbf{v}^i$ should contribute to the output for a given query $\mathbf{q}^j$.
3. **Weighted Sum for Output:** The values $\mathbf{v}^i$ are weighted by their corresponding normalized attention scores and then summed to produce the output element $\mathbf{b}^j$: \begin{split} \mathbf{b}^j& = \text{Attention}(\mathbf{q}^j,D)\\&=\sum_i\alpha_{j,i}'\mathbf{v}^i\\ &=\sum_i\alpha'(\mathbf{q}^j,\mathbf{k}^i)\mathbf{v}^i\\ &=\sum_i\frac{\exp\left(\alpha(\mathbf{q}^j,\mathbf{k}^i)\right)}{\sum_s \exp\left(\alpha(\mathbf{q}^j,\mathbf{k}^s)\right)}\mathbf{v}^i
\end{split} The term $\frac{\exp\left(\alpha(\mathbf{q}^j,\mathbf{k}^i)\right)}{\sum_s \exp\left(\alpha(\mathbf{q}^j,\mathbf{k}^s)\right)}$ is the normalized attention score (softmax), ensuring that all attention weights sum to 1.

#### Key Points
* **Representations Enhanced by Context:** The output of self-attention is a sequence of the same length as the input. However, each element in the output is now a weighted combination of all values in the input sequence, enriched with contextual information.
* **Long-Range Dependency Capture:** Self-attention excels at modeling relationships between elements regardless of their distance within the sequence. This is a significant advantage over traditional recurrent neural networks (RNNs), which struggle with long-range dependencies.
* **Why "Self"?** The term "self-attention" emphasizes that the queries, keys, and values all originate from the same input sequence. Each element attends to itself and all other elements.
### Multi-Head Attention: Enhancing Attention with Parallelism
While self-attention is powerful, it can benefit from considering multiple perspectives simultaneously. This is where multi-head attention comes in. It allows the model to attend to different parts of the input sequence in parallel, capturing a richer and more nuanced understanding of the relationships between elements.
#### Why Multi-Head Attention?
Consider the following example of an English-to-Chinese translation:
"The bank by the river is open for business."
→ "河邊的銀行正在營業"
If we were using single-head attention, when generating the word "銀行" (bank), the model might focus primarily on the word "bank" in the input. However, "bank" can have multiple meanings in Chinese, including "河岸" (riverbank). With a single focus, it would be difficult to capture all the relevant information – the fact that it's a financial institution ("bank"), that it's open ("is open"), and that it's for commercial purposes ("for business") – to accurately translate it as "银行".
Multi-head attention addresses this by employing multiple attention mechanisms (called "heads") in parallel. Each head can focus on different parts or aspects of the input sequence, leading to a more comprehensive understanding:
* **Head 1:** Might focus on the word "bank" itself.
* **Head 2:** Might focus on the phrase "is open".
* **Head 3:** Might focus on the phrase "for business".
By aggregating the information from these different heads, the model can more effectively disambiguate the meaning of "bank" in this context and produce the correct translation.
#### Calculation Process
Assume a multi-head attention mechanism with $h$ heads:
1. **Project Q, K, V for Each Head:** The original query $\mathbf{q}$, key $\mathbf{k}$, and value $\mathbf{v}$ vectors are projected into separate query, key, and value spaces for each head using distinct weight matrices. \begin{split}\mathbf{q}^{i,j} &= W^{q,j}\mathbf{q}^i \quad \text{(for each head $j$)}\\ \mathbf{k}^{i,j} &= W^{k,j}\mathbf{k}^i \\ \mathbf{v}^{i,j} &= W^{v,j}\mathbf{v}^i \end{split}
2. **Perform Attention in Parallel:** Each head independently performs the attention mechanism using its projected queries, keys, and values. This results in $h$ separate attention outputs. $$\mathbf{b}^{i,j} = \text{Attention}(\mathbf{q}^{i,j},D^j) \quad \text{(for each head $j$)}$$ where $D^j$ is the database of key-value pairs for head $j$.
3. **Concatenate and Transform:** The outputs from each head are concatenated and then transformed using another weight matrix $W^o$ to produce the final output vector $b^i$: $$\mathbf{b}^i = \text{Concat}(\mathbf{b}^{i,1},\dots\mathbf{b}^{i,h})W^o$$
#### Code Implementation
```python=
import torch.nn as nn
multihead_attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
attn_output, attn_output_weights = multihead_attn(query, key, value)
```
**Important Note:** The total number of parameters in a multi-head attention layer remains approximately the same as a single-head attention layer with the same overall dimensionality. This is because the dimensionality of each head is reduced proportionally to the number of heads.
### Positional Encoding: Infusing Order into Sequences
Attention mechanisms are inherently permutation-invariant – they don't care about the order of elements in a sequence. However, in many natural language processing tasks, the order of words is crucial to understanding meaning. Positional encoding is the solution to this challenge. It injects information about the position of each element into the input representation, allowing the Transformer to consider word order when computing attention.
#### Formalizing Positional Encoding
Given an input sequence $\mathbf{X} \in \mathbb{R}^{n \times d}$ consisting of n tokens, each with dimension $d$, we create a positional embedding matrix $\mathbf{P} \in \mathbb{R}^{n \times d}$ of the same shape. The elements of $\mathbf{P}$ encode the positional information for each token.
The final input representation to the Transformer is then obtained by adding the token embeddings $\mathbf{X}$ and the positional embeddings $\mathbf{P}$:
$$
\mathbf{X}' = \mathbf{X} + \mathbf{P}
$$
#### Types of Positional Encoding
* **Absolute Positional Encoding:** Each position in the sequence is assigned a unique, fixed embedding vector that is independent of the other tokens. This provides the model with explicit information about the absolute position of each token.
* **Relative Positional Encoding:** Instead of absolute positions, this approach encodes the relative distances between tokens. This can be beneficial for tasks where the relationships between tokens are more important than their exact positions.
#### Sinusoidal Positional Encoding: A Popular Choice
A widely used absolute positional encoding technique is based on sine and cosine functions. The elements of the positional embedding matrix $\mathbf{P}$ are defined as follows:
\begin{split}
\mathbf{P}_{i, 2j} &= \sin\left(\frac{i}{10000^{2j/d}}\right) \\
\mathbf{P}_{i, 2j+1} &= \cos\left(\frac{i}{10000^{2j/d}}\right)
\end{split}
where $i$ is the position of the token, $j$ is the dimension index $(0, 1, \dots, d-1)$, and $d$ is the embedding dimension.

**Key Property of Sinusoidal Encoding:** This encoding has the remarkable property of incorporating both absolute and relative positional information. The sine and cosine functions create a pattern that allows the model to easily learn to attend to relative positions through linear transformations. Specifically, for a fixed offset $\delta$, the pair $(p_{i+\delta,2j},p_{i+\delta,2j+1})$ can be obtained by rotating $(p_{i,2j},p_{i,2j+1})$ by an angle of $\delta \omega_j$. This makes it easier for the model to generalize to sequences longer than those seen during training.
## Sequence-to-Sequence (Seq2Seq) Models: The Foundation for Transformers
Sequence-to-Sequence (Seq2Seq) models, typically built with Recurrent Neural Networks (RNNs), are a key concept for understanding Transformers. They excel at transforming one sequence into another, making them valuable for tasks like machine translation, summarization, and dialogue systems.
### Core Components
* **Encoder:** This component sequentially processes the input sequence, compressing its information into a fixed-size "context vector." Think of it as distilling the essence of the input into a concentrated form.
* **Decoder:** Using the context vector as a starting point, the decoder generates the output sequence step-by-step. It predicts the next element based on both the context and its previous predictions.

### Encoder in Action
The encoder's job is to create a meaningful summary of the input. It achieves this by processing each element in order and updating its internal state. The final state, representing the condensed input information, is then handed off to the decoder.

### Decoder in Action
The decoder is responsible for producing the output sequence. It starts with a special `<SOS>` (Start of Sequence) token and uses the context vector as its initial guide. At each step:
1. It takes its previous prediction (or `<SOS>` at the start) as input.
2. It updates its internal state based on this input and its prior state.
3. It predicts the next element of the output sequence.
This process repeats until it generates an `<EOS>` (End of Sequence) token, signifying the output is complete.

### The Bottleneck Challenge
Seq2Seq models handle shorter sequences well but struggle with longer ones due to the "information bottleneck." The encoder squeezes the entire input into a fixed-size context vector, potentially losing crucial details when the input is lengthy. This can impact the model's ability to generate accurate outputs, especially for tasks requiring a deep understanding of the entire input.
**Key Point:** While Seq2Seq models pioneered sequence-to-sequence tasks, their limitations paved the way for the Transformer, which overcomes the bottleneck through attention mechanisms and parallel processing, enabling it to handle longer sequences and more complex relationships effectively.
## The Transformer Architecture: Transcending Seq2Seq Limitations
The Transformer, a groundbreaking architecture introduced in the paper "Attention Is All You Need," builds upon the foundation of sequence-to-sequence (Seq2Seq) models but overcomes their key limitation: the information bottleneck. Transformers achieve this by replacing recurrent neural networks (RNNs) with attention mechanisms and parallel processing. This enables them to efficiently capture long-range dependencies and scale to handle larger datasets.
| Feature | Traditional RNN-based Seq2Seq model | Transformer |
|---------|----------------------------|----------------------------------|
| Input processing | Encoder$\rightarrow$Decoder sequentially | Parallel (entire sequence at once) |
| Long-range dependencies | Compressed into a context vector | Directly access via attention |
| Parallelizability | Low | High |
|Input selectivity | Learns to select, but with loss | High (via attention) |
| Computational efficiency | Lower, especially for long sequences | Higher, especially for long sequences |
### Encoder: Unveiling Meaningful Representations
The encoder's primary role is to transform the input sequence into a rich set of representations that capture the semantic meaning and relationships within the text. Here's a breakdown of the encoding process:
1. **Embedding and Positional Encoding:**
* **Embedding:** The input sequence (e.g., words or subword units) is first converted into dense, continuous vector representations known as embeddings.
* **Positional Encoding:** Since the Transformer processes the entire sequence simultaneously, positional information is infused into the embeddings. This is crucial for the model to understand the order of words, which is often essential in language tasks.
2. **Multi-Head Attention:**
* The embedded sequence is then processed by multiple self-attention heads working in parallel. Each head focuses on different aspects of the input, allowing the model to capture a diverse range of relationships between words.
3. **Add & Norm:**
* A residual connection is used to add the original input to the output of the attention layer. This helps with gradient flow during training, preventing vanishing gradients.
* Subsequently, layer normalization is applied to stabilize the learning process and ensure that values remain within a reasonable range.
4. **Position-Wise Feed-Forward Network:**
* A feed-forward network is applied independently to each position in the sequence. This network consists of two linear transformations with a ReLU activation function in between. Its purpose is to further refine the representations generated by the attention layer.

#### Code Implementation
##### PositionWiseFFN
```python=
class PositionWiseFFN(nn.Module):
"""The position-wise feed-forward network."""
def __init__(self, d_model, d_ff): # Using d_model (from Transformer) for consistency
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.relu = nn.ReLU()
self.linear2 = nn.Linear(d_ff, d_model)
def forward(self, x):
return self.linear2(self.relu(self.linear1(x)))
```
##### AddNorm
```python=
class AddNorm(nn.Module):
"""Residual connection followed by layer normalization."""
def __init__(self, d_model, dropout):
super().__init__()
self.dropout = nn.Dropout(dropout)
self.ln = nn.LayerNorm(d_model)
def forward(self, x, sublayer): # More descriptive variable name
return self.ln(self.dropout(sublayer) + x) # Ensure dropout is applied before the addition
```
##### Encoder
```python=
class Encoder(nn.Module):
def __init__(self, n_src_vocab, n_layers=6, n_heads=8, d_model=512, d_ff=2048,
dropout=0.1, n_position=200, pad_idx=0):
super(Encoder, self).__init__()
self.src_emb = nn.Embedding(n_src_vocab, d_model, padding_idx=pad_idx)
self.pos_emb = PositionalEncoding(d_model, n_position=n_position)
self.dropout = nn.Dropout(p=dropout)
self.layer_stack = nn.ModuleList([
EncoderLayer(d_model, d_ff, n_heads, dropout=dropout)
for _ in range(n_layers)])
self.layer_norm = nn.LayerNorm(d_model)
def forward(self, src_seq, src_mask):
# -- Forward
enc_output = self.dropout(self.pos_emb(self.src_emb(src_seq)))
enc_output = self.layer_norm(enc_output)
for enc_layer in self.layer_stack:
enc_output = enc_layer(enc_output, mask=src_mask)
return enc_output
```
### Decoder: Generating the Target Sequence
The decoder's task is to generate the desired output sequence, one token at a time. Its operation is similar to the encoder, but with a few crucial distinctions:
1. **Masked Multi-Head Attention:**
* The decoder employs a masked version of multi-head attention. This mask prevents the model from attending to future positions in the output sequence during training. This ensures that predictions are made solely based on the context of past tokens, maintaining the autoregressive property of the decoder.
2. **Encoder-Decoder Attention (Cross-Attention):**
* The decoder incorporates a cross-attention layer. Here, the queries originate from the decoder's previous outputs, while the keys and values are derived from the encoder's final outputs. This mechanism allows the decoder to focus on the most relevant parts of the input sequence while generating each output token.
3. **Position-Wise Feed-Forward Network:**
* Akin to the encoder, the decoder also features a position-wise feed-forward network to further refine the representations.

#### Code Implementation
```python=
class MaskedMultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.multihead_attn = nn.MultiheadAttention(embed_dim=d_model, num_heads=num_heads, batch_first=True)
self.register_buffer("attn_mask", torch.triu(torch.ones(10000, 10000), diagonal=1).bool()) # pre-create a large mask
def forward(self, query, key, value):
batch_size, seq_len, _ = query.size()
attn_mask = self.attn_mask[:seq_len, :seq_len].to(query.device)
output, attn_output_weights = self.multihead_attn(query, key, value, attn_mask=attn_mask)
return output, attn_output_weights
```
```python=
class Decoder(nn.Module):
def __init__(self, n_tgt_vocab, n_layers=6, n_heads=8, d_model=512, d_ff=2048,
dropout=0.1, n_position=200, pad_idx=0):
super(Decoder, self).__init__()
self.tgt_emb = nn.Embedding(n_tgt_vocab, d_model, padding_idx=pad_idx)
self.pos_emb = PositionalEncoding(d_model, n_position=n_position)
self.dropout = nn.Dropout(p=dropout)
self.layer_stack = nn.ModuleList([
DecoderLayer(d_model, d_ff, n_heads, dropout=dropout)
for _ in range(n_layers)])
self.layer_norm = nn.LayerNorm(d_model)
def forward(self, tgt_seq, tgt_mask, enc_output, src_mask):
# -- Forward
dec_output = self.dropout(self.pos_emb(self.tgt_emb(tgt_seq)))
dec_output = self.layer_norm(dec_output)
for dec_layer in self.layer_stack:
dec_output = dec_layer(dec_output, enc_output, tgt_mask, src_mask)
return dec_output
```
### The Complete Transformer Architecture
The full Transformer architecture is a stack of multiple encoder and decoder layers. Each layer processes the input sequence in parallel and refines the representations to capture complex dependencies and relationships. The final output of the decoder is passed through a linear layer and a softmax function to produce a probability distribution over the vocabulary, allowing the model to predict the next token in the sequence.

## Example: Building a Chinese-English Translation Model with PyTorch's Transformer
This hands-on example demonstrates how to construct a Chinese-to-English translation model using PyTorch's Transformer implementation. The code is adapted and improved from Sovit Ranjan Rath's [Language Translation using PyTorch Transformer](https://debuggercafe.com/language-translation-using-pytorch-transformer/).
### Dataset Selection
We'll use the [Tatoeba](https://tatoeba.org/zh-cn/downloads) dataset, which offers a good balance of size and quality for this task.
### Preprocessing
We'll use `torchtext` and `spacy` for tokenization and vocabulary management.
### Installation
```python=
!pip install -U portalocker
!python -m spacy download en_core_web_sm
!python -m spacy download zh_core_web_sm
```
### Imports
```python=
# Tokenizer and vocabulary tools
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from typing import Iterable, List
# Padding and data handling
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
# Core model and utilities
from torch.nn import Transformer
from torch import Tensor
from timeit import default_timer as timer
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm
# Other essentials
import torch.nn as nn
import torch
import torch.nn.functional as F
import numpy as np
import math
import os
import pandas as pd
import matplotlib.pyplot as plt
import pickle
import csv
import time
```
### Data Preparation and Loading
#### Loading and Exploring the Dataset
The Tatoeba dataset provides pairs of translated sentences in Chinese and English, stored in a tab-separated values (TSV) format.
##### Example Tatoeba Data (Raw)
```
1277 I have to go to sleep. 2 我该去睡觉了。
1280 Today is June 18th and it is Muiriel's birthday! 5 今天是6月18号,也是Muiriel的生日!
```
We'll use Pandas to efficiently load and process this data.
1. **Load TSV Data with Pandas:** Directly read the TSV file, specifying column names for clarity.
2. **Preview and Filter:** Display the first few rows of relevant data (English and Chinese translations). We can also filter out any unnecessary columns.
3. **Save Processed Data:** Convert the filtered DataFrame to CSV for easier manipulation in the later stages.
```python=
# Load TSV Data
data_path = 'data/Tatoeba.tsv'
df = pd.read_csv(data_path, sep='\t', header=None, names=['index', 'English', 'unknown', 'Chinese'])
# Preview data
print("Example Tatoeba Data (Raw):")
print(df[['English', 'Chinese']].head().to_markdown(index=False, numalign="left", stralign="left"))
# Save processed data
df[['English', 'Chinese']].to_csv('input/Tatoeba_processed.csv', index=False)
```
```
Example Tatoeba Data (Raw):
| English | Chinese |
|:-------------------------------------------------|:--------------------------------------|
| I have to go to sleep. | 我该去睡觉了。 |
| Today is June 18th and it is Muiriel's birthday! | 今天是6月18号,也是Muiriel的生日! |
| Muiriel is 20 now. | Muiriel现在20岁了。 |
| The password is "Muiriel". | 密码是"Muiriel"。 |
| I will be back soon. | 我很快就會回來。 |
```
#### Splitting into Training and Validation Sets
We'll use `train_test_split` from the `sklearn.model_selection` module to create training and validation sets.
```python=
# Define test size and random seed
TEST_SIZE = 0.1
RANDOM_SEED = 42
# Split the data
train_csv, valid_csv = train_test_split(df, test_size=TEST_SIZE, random_state=RANDOM_SEED)
```
#### Tokenizer Setup
```python=
SRC_LANGUAGE = 'zh' # Source language (Chinese)
TGT_LANGUAGE = 'en' # Target language (English)
# Placeholders for tokenizer and vocabulary objects
token_transform = {}
vocab_transform = {}
# Initialize tokenizers for both languages
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='zh_core_web_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')
```
#### Creating Custom Dataset Classes
We define custom PyTorch `Dataset` classes for convenient data loading and batching during training and validation:
```python=
class TranslationDataset(Dataset):
def __init__(self, csv):
self.csv = csv
def __len__(self):
return len(self.csv)
def __getitem__(self, idx):
return (
self.csv['Chinese'].iloc[idx], # Source sentence (Chinese)
self.csv['English'].iloc[idx] # Target sentence (English)
)
train_dataset = TranslationDataset(train_csv)
valid_dataset = TranslationDataset(valid_csv)
```
#### Building Vocabularies
The `build_vocab_from_iterator` function from `torchtext` is used to create vocabulary objects for both the source and target languages. These vocabularies map each unique word or token to a numerical index.
```python=
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}
for data_sample in tqdm(data_iter,total=len(data_iter),leave=False):
yield token_transform[language](data_sample[language_index[language]])
vocab_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
vocab_transform[ln] = build_vocab_from_iterator(
yield_tokens(train_dataset, ln),
min_freq=2,
specials=special_symbols,
special_first=True
)
vocab_transform[ln].set_default_index(UNK_IDX)
print(f'Chinese Vocabulary size: {len(vocab_transform[SRC_LANGUAGE])}')
print(f'English Vocabulary size: {len(vocab_transform[TGT_LANGUAGE])}')
```
```
Chinese Vocabulary size: 14116
English Vocabulary size: 9027
```
### Essential Data Processing Functions
To bridge the gap between raw text data and the Transformer model's numerical input, we define a set of essential functions. These functions create a streamlined pipeline that converts sentences into a format suitable for model training and inference.
#### Text Transformation Pipeline
The `text_transform` function is a crucial component. It encapsulates three key steps to convert raw text into tensor indices:
1. **Tokenization (`token_transform`):** Breaks down sentences into individual words or subwords (tokens). We utilize spaCy's tokenizers for this, tailored for both Chinese and English.
2. **Numericalization (`vocab_transform`):** Maps each token to its corresponding numerical index in the vocabulary we built earlier. This step is essential for the model to understand the input data.
3. **Tensor Conversion (`tensor_transform`):** Transforms the list of token indices into a PyTorch tensor. Additionally, it adds special start-of-sequence (`<bos>`) and end-of-sequence (`<eos>`) tokens to mark the boundaries of each sentence. These are important for training the Transformer.
```python=
def sequential_transforms(*transforms):
return lambda x: reduce(lambda a, f: f(a), transforms, x)
def tensor_transform(token_ids: List[int]):
return torch.cat((torch.tensor([BOS_IDX]),
torch.tensor(token_ids),
torch.tensor([EOS_IDX])))
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
text_transform[ln] = sequential_transforms(token_transform[ln],
vocab_transform[ln],
tensor_transform)
```
#### Testing Tokenization and Numericalization
Let's quickly verify that our text transformation pipeline works as intended:
```python=
def translate_tensor(src_tensor: torch.Tensor, ln):
tokens = src_tensor[1:-1].numpy() # Remove <bos> and <eos>
return " ".join(vocab_transform[ln].lookup_tokens(list(tokens)))
def test_tokenize(sentence: str, ln):
token_ids = text_transform[ln](sentence).view(-1)
print(f'Original: {sentence}')
print(f'Tokenized: {token_ids}')
print(f'Translated back: {translate_tensor(token_ids, ln)}\n')
test_tokenize("It's never too late to study machine learning.", TGT_LANGUAGE)
test_tokenize("活到老,学到老。", SRC_LANGUAGE)
```
```
Original: It's never too late to study machine learning.
Tokenized: tensor([ 2, 38, 16, 136, 108, 223, 7, 263, 906, 657, 3])
Translated back: It 's never too late to study machine learning
Original: 活到老,学到老。
Tokenized: tensor([ 2, 587, 50, 365, 11, 2473, 365, 4, 3])
Translated back: 活 到 老 , 学到 老 。
```
#### Batch Preparation (Data Collation)
To feed data to the Transformer model efficiently, we need to group sentences into batches. Since sentences vary in length, we use padding to ensure all examples within a batch have the same size.
We define a custom function called `collate_fn` to handle this batch preparation process:
```python=
# Function to collate data samples into batch tensors
def collate_fn(batch):
src_batch, tgt_batch = [], []
for src_sample, tgt_sample in batch:
# Remove trailing newline characters (common in text files)
src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))
# Pad sequences for consistent batch length (and ensure batch_first=True)
src_batch = pad_sequence(src_batch, padding_value=PAD_IDX, batch_first=True)
tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX, batch_first=True)
return src_batch, tgt_batch
```
##### Explanation of collate_fn
1. **Create Empty Lists:** Initializes empty lists (`src_batch`, `tgt_batch`) to store the tokenized and numericalized source and target sentences.
2. **Iterate and Transform:** Loops through each sentence pair in the batch. Applies the `text_transform` to both the source and target sentences, converting them into numerical representations (tensors). It also removes any trailing newline characters (common in text files) before applying the transformations.
3. **Pad to Maximum Length:** Determines the maximum length of the sentences in the batch and uses the `pad_sequence` function to pad shorter sentences with the `<pad>` token (`PAD_IDX`). This creates tensors of uniform length, which is necessary for batch processing.
4. **Return Batch Tensors:** Returns a tuple containing the padded source batch and target batch tensors, ready to be fed into the Transformer model.
#### DataLoader
We create `DataLoader` instances to efficiently manage batching and data loading during training and validation:
```python=
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn,
shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=BATCH_SIZE,
collate_fn=collate_fn)
```
### Hyperparameters
Hyperparameters significantly influence the model's performance, training speed, and memory usage. Finding the right balance is crucial.
#### Key Hyperparameters
* **Model Size and Architecture:**
* **`EMB_SIZE` (Embedding Size):** Determines the dimensionality of the word embeddings. Larger embeddings can capture more nuanced information but require more memory.
* **`NHEAD` (Number of Attention Heads):** Controls the number of attention mechanisms used in the Transformer. More heads can potentially learn diverse patterns, but it increases the computational cost.
* **`FFN_HID_DIM` (Feedforward Network Hidden Dimension):** The size of the hidden layers in the feedforward networks within each Transformer layer. Larger dimensions offer more capacity but can lead to longer training times.
* **`NUM_ENCODER_LAYERS` & `NUM_DECODER_LAYERS`:** The number of encoder and decoder layers in the Transformer. Deeper models are generally more powerful, but they also demand more resources.
* **Training Parameters:**
* **`BATCH_SIZE`:** The number of sentence pairs processed in each training step. Larger batch sizes can leverage parallel processing but may exceed GPU memory.
* **`NUM_EPOCHS`:** The number of times the entire dataset is passed through the model during training. More epochs may improve accuracy, but excessive epochs can lead to overfitting.
* **Optimization:**
* **Learning Rate:** (Not explicitly listed but crucial): Controls the step size in updating model parameters during optimization. It significantly impacts training speed and convergence.
#### Parameter Selection Strategy
1. **GPU Constraints:** Consider your GPU's VRAM. In this example, with a GTX 1650 Ti (4GB VRAM), models with fewer than 10 million parameters are advisable to avoid out-of-memory errors. The provided table can help you choose suitable values for `EMB_SIZE`, `NHEAD`, and the number of layers.
2. **Start Small:** Begin with a smaller model (fewer layers, smaller embedding size) and gradually increase complexity while monitoring performance and resource usage.
3. **Balance Batch Size:** Aim for the largest `BATCH_SIZE` that your GPU can handle without running out of memory. This can speed up training.
4. **Tune the Learning Rate:** Experiment with different learning rates. A common approach is to start with a relatively large learning rate (e.g., 0.001) and gradually decrease it during training.
#### Hyperparameters in this Example
```python=
SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 192 # Embedding size (divisible by NHEAD)
NHEAD = 6 # Number of attention heads
FFN_HID_DIM = 192 # Feedforward network hidden dimension
BATCH_SIZE = 192
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3
DEVICE = 'cuda'
NUM_EPOCHS = 120
```
### Model Implementation
#### Masking for Transformer Attention
Transformers leverage attention mechanisms to weigh the importance of different words in a sentence. However, not all words should be attended to equally in every context. We use masks to control the attention patterns:
1. **Source Mask (`src_mask`):**
* Shape: `(S, S)` (S: source sequence length)
* Purpose: Informs the encoder which positions in the source sequence are valid for attention. Since all positions are relevant, this mask is typically filled with `False`.
2. **Target Mask (`tgt_mask`):**
* Shape:` (T, T)` (T: target sequence length)
* Purpose: Prevents the decoder from attending to future positions when generating output. This ensures the model generates translations autoregressively (word by word, from left to right). The mask is an upper triangular matrix where the upper triangle is filled with $-\infty$ and the lower triangle is 0.
3. **Source Padding Mask (`src_padding_mask`):**
* Shape: `(N, S)` (N: batch size)
* Purpose: Identifies padding tokens (`PAD_IDX`) in the source sequence so the model can ignore them during attention calculations.
3. **Target Padding Mask (`tgt_padding_mask`):**
* Shape: `(N, T)`
* Purpose: Similar to the source padding mask, but for the target sequence.
```python=
def generate_square_subsequent_mask(sz):
"""Generates an upper-triangular mask for the target sequence."""
mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
return mask
def create_mask(src, tgt):
"""Creates all four masks needed for the Transformer."""
src_seq_len = src.shape[1]
tgt_seq_len = tgt.shape[1]
tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)
src_padding_mask = (src == PAD_IDX)
tgt_padding_mask = (tgt == PAD_IDX)
return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask
```
#### Positional Encoding
The Transformer architecture doesn't inherently have a way to understand the order of words in a sentence. Positional encoding is a technique to inject information about word positions into the model. It adds a unique vector to each word embedding based on its position in the sentence.
```python=
class PositionalEncoding(nn.Module):
def __init__(self, d_model, dropout, max_len=5000):
"""
:param max_len: Input length sequence.
:param d_model: Embedding dimension.
:param dropout: Dropout value (default=0.1)
"""
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
"""
Inputs of forward function
:param x: the sequence fed to the positional encoder model (required).
Shape:
x: [sequence length, batch size, embed dim]
output: [sequence length, batch size, embed dim]
"""
x = x + self.pe[:, :x.size(1)]
return self.dropout(x)
```
#### Token Embedding
Token embedding is the process of converting each word in a sentence into a dense vector representation. These embeddings are usually learned during training and capture semantic information about the words.
```python=
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size
def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
```
#### Sequence-to-Sequence Transformer Model (`Seq2SeqTransformer`)
This is the core architecture for our translation task. It consists of:
* **Encoder:** Processes the source (Chinese) sentence, producing a contextualized representation of each word.
* **Decoder:** Generates the target (English) sentence one word at a time, conditioned on the encoder's output and the previously generated words.
* **Token Embeddings and Positional Encoding:** These layers are applied to both the source and target sequences.
* **Generator:** The final linear layer that projects the decoder's output into the target vocabulary space for word prediction.
```python=
class Seq2SeqTransformer(nn.Module):
def __init__(
self,
num_encoder_layers: int,
num_decoder_layers: int,
emb_size: int,
nhead: int,
src_vocab_size: int,
tgt_vocab_size: int,
dim_feedforward: int = 512,
dropout: float = 0.1
):
super(Seq2SeqTransformer, self).__init__()
self.transformer = Transformer(
d_model=emb_size,
nhead=nhead,
num_encoder_layers=num_encoder_layers,
num_decoder_layers=num_decoder_layers,
dim_feedforward=dim_feedforward,
dropout=dropout,
batch_first=True
)
self.generator = nn.Linear(emb_size, tgt_vocab_size)
self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
self.positional_encoding = PositionalEncoding(
emb_size, dropout=dropout)
def forward(self,
src: Tensor,
trg: Tensor,
src_mask: Tensor,
tgt_mask: Tensor,
src_padding_mask: Tensor,
tgt_padding_mask: Tensor,
memory_key_padding_mask: Tensor):
src_emb = self.positional_encoding(self.src_tok_emb(src))
tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
return self.generator(outs)
def encode(self, src: Tensor, src_mask: Tensor):
return self.transformer.encoder(self.positional_encoding(
self.src_tok_emb(src)), src_mask)
def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
return self.transformer.decoder(self.positional_encoding(
self.tgt_tok_emb(tgt)), memory,
tgt_mask)
```
```python=
# Model instantiation and summary
model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM).to(DEVICE)
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")
```
```
8,928,999 total parameters.
8,928,999 training parameters.
```
### Training and Validation
Now let's put our trained model to the test! We'll define functions to generate translations and then showcase some examples.
#### Training and Evaluation Loop
The core training and evaluation procedures are encapsulated in the following functions.
```python=
def train_epoch(model, optimizer):
"""Performs one epoch of training."""
model.train() # Set the model to training mode
total_loss = 0 # Initialize total loss for the epoch
for src, tgt in tqdm(train_dataloader, desc="Training", leave=False, total=len(train_dataloader)):
src = src.to(DEVICE)
tgt = tgt.to(DEVICE)
tgt_input = tgt[:, :-1] # Exclude the last token for input
# Create masks
src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
# Forward pass
logits = model(src, tgt_input, src_mask, tgt_mask,
src_padding_mask, tgt_padding_mask, src_padding_mask)
# Compute loss
tgt_out = tgt[:, 1:] # Exclude the first token for output
loss = loss_fn(logits.reshape(-1, TGT_VOCAB_SIZE), tgt_out.reshape(-1))
# Backpropagation and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item() # Accumulate loss
return total_loss / len(train_dataloader) # Return average loss
```
#### Validation Step
```python=
def evaluate(model):
"""Evaluates the model on the validation set."""
model.eval() # Set the model to evaluation mode
total_loss = 0
with torch.no_grad():
for src, tgt in tqdm(val_dataloader, desc="Validating", leave=False, total=len(val_dataloader)):
# Similar logic as in training, but without backpropagation
src = src.to(DEVICE)
tgt = tgt.to(DEVICE)
tgt_input = tgt[:, :-1]
src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
logits = model(src, tgt_input, src_mask, tgt_mask,
src_padding_mask, tgt_padding_mask, src_padding_mask)
tgt_out = tgt[:, 1:]
loss = loss_fn(logits.reshape(-1, TGT_VOCAB_SIZE), tgt_out.reshape(-1))
total_loss += loss.item()
return total_loss / len(val_dataloader) # Return average loss
```
#### Model Training and Evaluation
We will now execute the training process, iterating over epochs, and tracking both training and validation loss. We will save the model checkpoint exhibiting the lowest validation loss.
```python=
# Lists to store loss values
train_losses, valid_losses = [], []
best_epoch = -1
best_val_loss = float('inf')
for epoch in range(1, NUM_EPOCHS + 1):
epoch_start_time = time.time()
train_loss = train_epoch(model, optimizer)
valid_loss = evaluate(model)
epoch_end_time = time.time()
train_losses.append(train_loss)
valid_losses.append(valid_loss)
if epoch%10==0:
print(f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {valid_loss:.3f}, "
f"Epoch time = {(epoch_end_time - epoch_start_time):.3f}s")
# Save the best model
if valid_loss < best_val_loss:
best_epoch = epoch
best_val_loss = valid_loss
torch.save(model.state_dict(), 'outputs/model-best.pth')
# Save model checkpoints at each epoch (optional)
# torch.save(model.state_dict(), f'outputs/model-epoch{epoch}-loss{valid_loss:.5f}.pth')
print(f'Best epoch: {best_epoch}')
```
```
Epoch: 10, Train loss: 3.918, Val loss: 3.773, Epoch time = 931.681s
Epoch: 20, Train loss: 3.287, Val loss: 3.269, Epoch time = 888.858s
Epoch: 30, Train loss: 2.827, Val loss: 2.954, Epoch time = 529.746s
Epoch: 40, Train loss: 2.474, Val loss: 2.749, Epoch time = 974.280s
Epoch: 50, Train loss: 2.192, Val loss: 2.621, Epoch time = 1061.845s
Epoch: 60, Train loss: 1.973, Val loss: 2.538, Epoch time = 458.240s
Epoch: 70, Train loss: 1.794, Val loss: 2.487, Epoch time = 459.281s
Epoch: 80, Train loss: 1.653, Val loss: 2.455, Epoch time = 459.611s
Epoch: 90, Train loss: 1.532, Val loss: 2.439, Epoch time = 458.892s
Epoch: 100, Train loss: 1.435, Val loss: 2.434, Epoch time = 459.066s
Epoch: 110, Train loss: 1.440, Val loss: 2.434, Epoch time = 1194.564s
Best epoch: 97
```
##### Plot Loss Curve
```python=
# Plotting the Loss Curve (Code remains the same)
def save_plots(train_loss, valid_loss):
"""
Function to save the loss plots to disk.
"""
# Loss plots.
plt.figure(figsize=(10, 7))
plt.plot(
train_loss, color='blue', linestyle='-',
label='train loss'
)
plt.plot(
valid_loss, color='red', linestyle='-',
label='validataion loss'
)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.savefig(os.path.join('outputs', 'loss.png'))
plt.show()
save_plots(train_losses, valid_losses)
```

### Translation Examples
#### Translation Functions
Now that we have a trained model, let's define the functions to translate sentences from Chinese to English.
```python=
def greedy_decode(model, src, src_mask, max_len, start_symbol):
"""
Generates the target sequence using a greedy decoding approach.
Args:
model: The trained Transformer model.
src: The source sentence tensor (Chinese).
src_mask: The source mask.
max_len: The maximum length for the generated target sequence.
start_symbol: The index of the start-of-sequence token (`<bos>`).
Returns:
The generated target sequence tensor (English).
"""
src = src.to(DEVICE)
src_mask = src_mask.to(DEVICE)
memory = model.encode(src, src_mask) # Encoder's output
ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE) # Initialize target sequence with <bos>
for i in range(max_len - 1):
memory = memory.to(DEVICE) # Ensure memory is on the correct device
tgt_mask = (generate_square_subsequent_mask(ys.size(1)).type(torch.bool)).to(DEVICE)
out = model.decode(ys, memory, tgt_mask)
prob = model.generator(out[:, -1]) # Get probabilities for the next word
_, next_word = torch.max(prob, dim=1)
next_word = next_word.item()
ys = torch.cat([ys,
torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)
if next_word == EOS_IDX: # Stop if <eos> is generated
break
return ys
def translate(model: torch.nn.Module, src_sentence: str):
"""Translates a given source sentence into the target language."""
model.eval() # Set the model to evaluation mode
# Preprocess the source sentence
src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1) # Shape: (src_len, 1)
num_tokens = src.shape[0]
src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool).to(DEVICE)
# Generate the target sequence using greedy decoding
tgt_tokens = greedy_decode(
model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX
).flatten()
# Convert the generated tokens back to text and remove special tokens
return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")
def translate_tensor(src_tensor: torch.Tensor, ln):
"""Helper function to translate a tensor of token IDs back to text."""
ys = src_tensor[1:-1] # Remove <bos> and <eos>
return " ".join(vocab_transform[ln].lookup_tokens(list(ys.numpy()))).replace("<bos>", "").replace("<eos>", "")
```
#### Evaluating on Sample Sentences
Let's see how our model performs on some example sentences from the validation set:
```python=
# Define some sample sentences and their expected translations
sample_sentences = [
("我不怕死。", "I'm not scared to die."),
("你最好確認那是對的。", "You'd better make sure that it is true."),
("The clock has stopped.", "時鐘已經停了"),
("Please check the menu.", "請確認菜單."),
("I am a student.", "我是個學生."),
("Why is the training so slow?","為什麼訓練這麼慢"),
("We are ready for the test.","我們準備好考試了"),
("The dog haven't eat anything for 3 days.","狗三天沒吃任何東西"),
("Yesterday, I went to a park.","昨天我去了公園"),
("I love eating spicy food.","我愛吃辣的食物"),
("Knowledge is power.","知識就是力量"],
("Don't judge a book by its cover.","人不可貌相")
]
# Iterate through the sample sentences and print translations
for src_sentence, expected_translation in sample_sentences:
predicted_translation = translate(model, src_sentence)
print(f"SRC: {src_sentence}")
print(f"Expected: {expected_translation}")
print(f"Predicted: {predicted_translation}\n")
```
```
SRC: I'm not scared to die
GT: 我不怕死.
PRED: 我 不怕 死 。
SRC: You'd better make sure that it is true.
GT: 你最好確認那是對的.
PRED: 你 最好 做 的 事情 是 真的 。
SRC: The clock has stopped.
GT: 時鐘已經停了
PRED: 時鐘 已 經 停止 了 。
SRC: Please check the menu.
GT: 請確認菜單.
PRED: 请 把 菜单 。
SRC: I am a student.
GT: 我是個學生.
PRED: 我 是 學生 。
SRC: Why is the training so slow?
GT: 為什麼訓練這麼慢
PRED: 为什么 这么 慢 ?
SRC: We are ready for the test.
GT: 我們準備好考試了
PRED: 我 們 準備 好 考試 。
SRC: The dog haven't eat anything for 3 days.
GT: 狗三天沒吃任何東西
PRED: 狗 的 三 天 没 吃 东西 。
SRC: Yesterday, I went to a park.
GT: 昨天我去了公園
PRED: 昨天 我 去 公園 了 。
SRC: I love eating spicy food.
GT: 我愛吃辣的食物
PRED: 我 爱 吃 辣 。
SRC: Knowledge is power.
GT: 知識就是力量
PRED: 知識 就是 力量 。
SRC: Don't judge a book by its cover.
GT: 人不可貌相
PRED: 不要 根据 封面 一 本书 。
```
### Challenges and Solutions
#### Memory Exceedance (Out-of-Memory Errors)
* **Model Size:** If your model is too complex (large EMB_SIZE, many layers), it consumes more memory. Try reducing the embedding size or the number of layers. Increasing MIN_FREQ for vocabularies can also help by reducing their size.
* **Long Sequences:** If your dataset contains extremely long sentences, padding can lead to large batches that exceed your GPU memory. Consider setting a maximum sequence length and truncating longer sentences.
* **Batch Size:** Reducing the batch size directly decreases memory usage per step. However, if your model is still too large, you might encounter issues even with small batch sizes.
#### Long Training Time
* **Progress Monitoring (`tqdm`):** Always use the `tqdm` library for loops that might take a while. This provides a progress bar and time estimate, allowing you to better gauge training duration.
* **Use GPU:** Training on a CPU is significantly slower than on a GPU. Ensure your code is running on a GPU if available.
* **Batch Size Optimization:** While smaller batch sizes reduce memory usage, they can also lead to slower training. Experiment to find a balance between memory constraints and training speed. Increasing the batch size too much can lead to out-of-memory errors.
#### Performance Issues
* **Data Quantity:** Language models, particularly larger ones, thrive on massive amounts of data. Ensure your dataset contains enough examples (at least 10,000 sentence pairs, ideally more) to enable the model to generalize well.
* **Data Quality:** Always inspect your dataset for potential issues. Web-crawled data can sometimes have misaligned translations. If you can't filter out bad examples, consider using a different, cleaner dataset.
#### Additional Tips
* **Mixed Precision Training:** Using mixed precision (e.g., `torch.cuda.amp`) can significantly reduce memory usage and speed up training on compatible GPUs.
* **Gradient Accumulation:** If you need larger batch sizes but are limited by memory, consider gradient accumulation. This allows you to simulate larger batch sizes by accumulating gradients over multiple smaller batches.
* **Profiling:** Use PyTorch's profiling tools to identify performance bottlenecks in your code. This can guide you in optimizing specific parts of your training loop.
* **Experimentation:** Don't be afraid to experiment with different hyperparameters and architectures. The optimal configuration will depend on your specific dataset and goals.
## Reference
* Andreas Geiger. "L08 - Sequence Models." Deep Learning Course, University of Tübingen. Accessed August 3, 2024. [link](https://drive.google.com/file/d/1-riYFFIfJCCP5K002n5LepGL-AEhPBwK/view)
* Aston Zhang, Zack C. Lipton, Mu Li, and Alex J. Smola. "Chapter 11 - Attention Mechanisms." Dive into Deep Learning. Accessed August 3, 2024. [link](https://d2l.ai/)
* Ashish Vaswani, et al. "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NIPS 2017). [link](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
* Zake. "從零開始的 Sequence to Sequence." September 28, 2017. Accessed August 3, 2024. [link](https://zake7749.github.io/2017/09/28/Sequence-to-Sequence-tutorial/)
* Jay Alammar. "The Illustrated Transformer." Accessed August 3, 2024. [link](https://jalammar.github.io/illustrated-transformer/)
* Foam Liu. "英中机器文本翻译." GitHub repository. Accessed August 3, 2024. [link](https://github.com/foamliu/Transformer/tree/master)
* Sovit Ranjan Rath. "Language Translation using PyTorch Transformer." DebuggerCafe. Accessed August 3, 2024. [link](https://debuggercafe.com/language-translation-using-pytorch-transformer/)