Transformer - HackMD

# Attention and Transformer Models ## Introduction The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," has revolutionized natural language processing (NLP) and various other domains. Its ability to process sequences in parallel and capture long-range dependencies has led to significant advancements in tasks like machine translation, text summarization, and question answering. Transformers have extended their reach beyond NLP, making valuable contributions in computer vision, speech recognition, and music generation. ### Key Advantages of Transformers * **Parallel Processing:** Unlike RNNs, Transformers process the entire sequence simultaneously, resulting in faster training and inference. * **Long-Range Dependencies:** The attention mechanism allows Transformers to effectively handle relationships between words that are far apart in the sequence, a challenge for RNNs. * **Interpretability:** The attention weights provide insights into which parts of the input are most important for each output, enhancing model interpretability. | Feature | RNN-based Models | Transformer | |---------|----------------------------|----------------------------------| | Input processing | Sequential (one word/item at a time) | Parallel (entire sequence at once) | | Long-range dependencies | Challenging to capture accurately | Effectively handled through the attention mechanism | | Parallelizability | Low | High | |Input selectivity | Limited | High (via attention mechanism) | | Computational efficiency | Less efficient, especially with long sequences | More efficient, particularly for long sequences, due to parallelization and attention | ### Overview This tutorial covers the core components of Transformers: * **Attention Mechanisms:** Concept, types, and functions of attention. * **Self-Attention and Multi-Head Attention:** How these mechanisms enable focus on relevant input parts. * **Positional Encoding:** How positional information is incorporated into the model. * **Transformer Architecture:** Encoder and decoder components, and their internal layers. * **Practical Example:** Building an English-to-Chinese translator using a Transformer. ## Attention Mechanisms: The Building Blocks Attention mechanisms are the heart and soul of Transformer architectures. They revolutionize how models process sequential data by enabling them to focus on the most relevant parts of the input, thus enhancing understanding and performance across diverse tasks. ### Core Components of Attention Attention operates on three fundamental elements: * **Query $Q$:** Represents a request or context for information. Think of it as the question you want to ask of your data. * **Key $K$:** Represents a set of labels or identifiers associated with values in the data. Imagine these as the categories or topics under which your data is organized. * **Value $V$:** Represents the actual information or content associated with each key. These are the answers or details corresponding to your query and the relevant keys. The attention mechanism cleverly calculates a weighted sum of the values. The weights, known as attention scores, quantify how important each value is in the context of the given query. ### Attention Calculation Formally, given a database of $(key, value)$ pairs, $D = \{(k_1,v_1),\dots(k_n,v_n) \}$, and a query vector $q$, the attention mechanism produces an output as follows: $$ \text{Attention}(q,D)=\sum_i{\alpha'(q,k_i)}v_i $$ Here, $\alpha'(q,k_i)$ signifies the attention weight assigned to the i-th key-value pair with respect to the query. The higher the attention weight, the more relevant that key-value pair is considered to be. ### Attention Scoring Functions: Gauging Relevance Various scoring functions measure the similarity between a query and keys, leading to diverse attention weight distributions. Popular choices include: * **Dot Product:** $\alpha(\mathbf{q},\mathbf{k}^i)=\mathbf{q}^{T}\mathbf{k}^i$ A simple and efficient method, but susceptible to instability with high-dimensional inputs due to the potential for large values. * **Scaled Dot Product:** $\alpha(\mathbf{q},\mathbf{k}^i)=\frac{\mathbf{q}^{T}\mathbf{k}^i}{\sqrt d}$ Addresses the scaling issue of the dot product by dividing it by the square root of the key dimension $d$. This is the default choice in Transformers. * **Additive:** $\alpha(\mathbf q, \mathbf k^i) = \mathbf w_v^\top \textrm{tanh}(\mathbf W_q\mathbf q + \mathbf W_k \mathbf k^i)$ Offers greater flexibility to learn complex similarity patterns, but is computationally more expensive. | Function | Pros | Cons | |---------|----------------------------|----------------------------------| | Dot Product | Simple, efficient | Potential for large values in high dimensions | | Scaled Dot Product | Balances computation and stability | Slightly more expensive than dot product | | Additive | Captures complex patterns | Computationally demanding | ### Self-Attention: The Sequence Attends to Itself Self-attention is a specialized type of attention that enables a sequence to focus on its own internal structure and relationships. In this mechanism, each element of a sequence becomes both the query and the key-value pair. This allows each element to assess its relevance to other elements within the same sequence, capturing dependencies regardless of their distance. #### How Self-Attention Works Let's break down how self-attention operates on an input sequence of length $l$: $\mathbf{a}^1, \mathbf{a}^2, \dots, \mathbf{a}^l$ 1. **Query $Q$, Key $K$, and Value $V$ Generation:** For each element a $i$ in the input sequence, three vectors are generated through linear transformations using trainable weight matrices $W^q$, $W^k$, and $W^v$: \begin{split}\mathbf{q}^i &= W^q\mathbf{a}^i\\ \mathbf{k}^i &= W^k\mathbf{a}^i\\ \mathbf{v}^i &= W^v\mathbf{a}^i\end{split} 2. **Attention Score Calculation:** We compute the attention scores by applying a chosen scoring function (e.g., the scaled dot product) to each query-key pair $(\mathbf{q}^j, \mathbf{k}^i)$. These scores reflect how much each value $\mathbf{v}^i$ should contribute to the output for a given query $\mathbf{q}^j$. 3. **Weighted Sum for Output:** The values $\mathbf{v}^i$ are weighted by their corresponding normalized attention scores and then summed to produce the output element $\mathbf{b}^j$: \begin{split} \mathbf{b}^j& = \text{Attention}(\mathbf{q}^j,D)\\&=\sum_i\alpha_{j,i}'\mathbf{v}^i\\ &=\sum_i\alpha'(\mathbf{q}^j,\mathbf{k}^i)\mathbf{v}^i\\ &=\sum_i\frac{\exp\left(\alpha(\mathbf{q}^j,\mathbf{k}^i)\right)}{\sum_s \exp\left(\alpha(\mathbf{q}^j,\mathbf{k}^s)\right)}\mathbf{v}^i \end{split} The term $\frac{\exp\left(\alpha(\mathbf{q}^j,\mathbf{k}^i)\right)}{\sum_s \exp\left(\alpha(\mathbf{q}^j,\mathbf{k}^s)\right)}$ is the normalized attention score (softmax), ensuring that all attention weights sum to 1. ![Screenshot 2024-08-03 at 10.24.18 AM](https://hackmd.io/_uploads/HyekHGiKA.png) #### Key Points * **Representations Enhanced by Context:** The output of self-attention is a sequence of the same length as the input. However, each element in the output is now a weighted combination of all values in the input sequence, enriched with contextual information. * **Long-Range Dependency Capture:** Self-attention excels at modeling relationships between elements regardless of their distance within the sequence. This is a significant advantage over traditional recurrent neural networks (RNNs), which struggle with long-range dependencies. * **Why "Self"?** The term "self-attention" emphasizes that the queries, keys, and values all originate from the same input sequence. Each element attends to itself and all other elements. ### Multi-Head Attention: Enhancing Attention with Parallelism While self-attention is powerful, it can benefit from considering multiple perspectives simultaneously. This is where multi-head attention comes in. It allows the model to attend to different parts of the input sequence in parallel, capturing a richer and more nuanced understanding of the relationships between elements. #### Why Multi-Head Attention? Consider the following example of an English-to-Chinese translation: "The bank by the river is open for business." → "河邊的銀行正在營業" If we were using single-head attention, when generating the word "銀行" (bank), the model might focus primarily on the word "bank" in the input. However, "bank" can have multiple meanings in Chinese, including "河岸" (riverbank). With a single focus, it would be difficult to capture all the relevant information – the fact that it's a financial institution ("bank"), that it's open ("is open"), and that it's for commercial purposes ("for business") – to accurately translate it as "银行". Multi-head attention addresses this by employing multiple attention mechanisms (called "heads") in parallel. Each head can focus on different parts or aspects of the input sequence, leading to a more comprehensive understanding: * **Head 1:** Might focus on the word "bank" itself. * **Head 2:** Might focus on the phrase "is open". * **Head 3:** Might focus on the phrase "for business". By aggregating the information from these different heads, the model can more effectively disambiguate the meaning of "bank" in this context and produce the correct translation. #### Calculation Process Assume a multi-head attention mechanism with $h$ heads: 1. **Project Q, K, V for Each Head:** The original query $\mathbf{q}$, key $\mathbf{k}$, and value $\mathbf{v}$ vectors are projected into separate query, key, and value spaces for each head using distinct weight matrices. \begin{split}\mathbf{q}^{i,j} &= W^{q,j}\mathbf{q}^i \quad \text{(for each head $j$)}\\ \mathbf{k}^{i,j} &= W^{k,j}\mathbf{k}^i \\ \mathbf{v}^{i,j} &= W^{v,j}\mathbf{v}^i \end{split} 2. **Perform Attention in Parallel:** Each head independently performs the attention mechanism using its projected queries, keys, and values. This results in $h$ separate attention outputs. $$\mathbf{b}^{i,j} = \text{Attention}(\mathbf{q}^{i,j},D^j) \quad \text{(for each head $j$)}$$ where $D^j$ is the database of key-value pairs for head $j$. 3. **Concatenate and Transform:** The outputs from each head are concatenated and then transformed using another weight matrix $W^o$ to produce the final output vector $b^i$: $$\mathbf{b}^i = \text{Concat}(\mathbf{b}^{i,1},\dots\mathbf{b}^{i,h})W^o$$ #### Code Implementation ```python= import torch.nn as nn multihead_attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True) attn_output, attn_output_weights = multihead_attn(query, key, value) ``` **Important Note:** The total number of parameters in a multi-head attention layer remains approximately the same as a single-head attention layer with the same overall dimensionality. This is because the dimensionality of each head is reduced proportionally to the number of heads. ### Positional Encoding: Infusing Order into Sequences Attention mechanisms are inherently permutation-invariant – they don't care about the order of elements in a sequence. However, in many natural language processing tasks, the order of words is crucial to understanding meaning. Positional encoding is the solution to this challenge. It injects information about the position of each element into the input representation, allowing the Transformer to consider word order when computing attention. #### Formalizing Positional Encoding Given an input sequence $\mathbf{X} \in \mathbb{R}^{n \times d}$ consisting of n tokens, each with dimension $d$, we create a positional embedding matrix $\mathbf{P} \in \mathbb{R}^{n \times d}$ of the same shape. The elements of $\mathbf{P}$ encode the positional information for each token. The final input representation to the Transformer is then obtained by adding the token embeddings $\mathbf{X}$ and the positional embeddings $\mathbf{P}$: $$ \mathbf{X}' = \mathbf{X} + \mathbf{P} $$ #### Types of Positional Encoding * **Absolute Positional Encoding:** Each position in the sequence is assigned a unique, fixed embedding vector that is independent of the other tokens. This provides the model with explicit information about the absolute position of each token. * **Relative Positional Encoding:** Instead of absolute positions, this approach encodes the relative distances between tokens. This can be beneficial for tasks where the relationships between tokens are more important than their exact positions. #### Sinusoidal Positional Encoding: A Popular Choice A widely used absolute positional encoding technique is based on sine and cosine functions. The elements of the positional embedding matrix $\mathbf{P}$ are defined as follows: \begin{split} \mathbf{P}_{i, 2j} &= \sin\left(\frac{i}{10000^{2j/d}}\right) \\ \mathbf{P}_{i, 2j+1} &= \cos\left(\frac{i}{10000^{2j/d}}\right) \end{split} where $i$ is the position of the token, $j$ is the dimension index $(0, 1, \dots, d-1)$, and $d$ is the embedding dimension. ![position encoding](https://hackmd.io/_uploads/r19KqJB_A.png) **Key Property of Sinusoidal Encoding:** This encoding has the remarkable property of incorporating both absolute and relative positional information. The sine and cosine functions create a pattern that allows the model to easily learn to attend to relative positions through linear transformations. Specifically, for a fixed offset $\delta$, the pair $(p_{i+\delta,2j},p_{i+\delta,2j+1})$ can be obtained by rotating $(p_{i,2j},p_{i,2j+1})$ by an angle of $\delta \omega_j$. This makes it easier for the model to generalize to sequences longer than those seen during training. ## Sequence-to-Sequence (Seq2Seq) Models: The Foundation for Transformers Sequence-to-Sequence (Seq2Seq) models, typically built with Recurrent Neural Networks (RNNs), are a key concept for understanding Transformers. They excel at transforming one sequence into another, making them valuable for tasks like machine translation, summarization, and dialogue systems. ### Core Components * **Encoder:** This component sequentially processes the input sequence, compressing its information into a fixed-size "context vector." Think of it as distilling the essence of the input into a concentrated form. * **Decoder:** Using the context vector as a starting point, the decoder generates the output sequence step-by-step. It predicts the next element based on both the context and its previous predictions. ![seq2seq1](https://hackmd.io/_uploads/SkF50i6YR.png) ### Encoder in Action The encoder's job is to create a meaningful summary of the input. It achieves this by processing each element in order and updating its internal state. The final state, representing the condensed input information, is then handed off to the decoder. ![seq2seq encoder](https://hackmd.io/_uploads/Sy8HR4ycR.png) ### Decoder in Action The decoder is responsible for producing the output sequence. It starts with a special `<SOS>` (Start of Sequence) token and uses the context vector as its initial guide. At each step: 1. It takes its previous prediction (or `<SOS>` at the start) as input. 2. It updates its internal state based on this input and its prior state. 3. It predicts the next element of the output sequence. This process repeats until it generates an `<EOS>` (End of Sequence) token, signifying the output is complete. ![seq2seq decoder](https://hackmd.io/_uploads/SJg3p4k90.png) ### The Bottleneck Challenge Seq2Seq models handle shorter sequences well but struggle with longer ones due to the "information bottleneck." The encoder squeezes the entire input into a fixed-size context vector, potentially losing crucial details when the input is lengthy. This can impact the model's ability to generate accurate outputs, especially for tasks requiring a deep understanding of the entire input. **Key Point:** While Seq2Seq models pioneered sequence-to-sequence tasks, their limitations paved the way for the Transformer, which overcomes the bottleneck through attention mechanisms and parallel processing, enabling it to handle longer sequences and more complex relationships effectively. ## The Transformer Architecture: Transcending Seq2Seq Limitations The Transformer, a groundbreaking architecture introduced in the paper "Attention Is All You Need," builds upon the foundation of sequence-to-sequence (Seq2Seq) models but overcomes their key limitation: the information bottleneck. Transformers achieve this by replacing recurrent neural networks (RNNs) with attention mechanisms and parallel processing. This enables them to efficiently capture long-range dependencies and scale to handle larger datasets. | Feature | Traditional RNN-based Seq2Seq model | Transformer | |---------|----------------------------|----------------------------------| | Input processing | Encoder$\rightarrow$Decoder sequentially | Parallel (entire sequence at once) | | Long-range dependencies | Compressed into a context vector | Directly access via attention | | Parallelizability | Low | High | |Input selectivity | Learns to select, but with loss | High (via attention) | | Computational efficiency | Lower, especially for long sequences | Higher, especially for long sequences | ### Encoder: Unveiling Meaningful Representations The encoder's primary role is to transform the input sequence into a rich set of representations that capture the semantic meaning and relationships within the text. Here's a breakdown of the encoding process: 1. **Embedding and Positional Encoding:** * **Embedding:** The input sequence (e.g., words or subword units) is first converted into dense, continuous vector representations known as embeddings. * **Positional Encoding:** Since the Transformer processes the entire sequence simultaneously, positional information is infused into the embeddings. This is crucial for the model to understand the order of words, which is often essential in language tasks. 2. **Multi-Head Attention:** * The embedded sequence is then processed by multiple self-attention heads working in parallel. Each head focuses on different aspects of the input, allowing the model to capture a diverse range of relationships between words. 3. **Add & Norm:** * A residual connection is used to add the original input to the output of the attention layer. This helps with gradient flow during training, preventing vanishing gradients. * Subsequently, layer normalization is applied to stabilize the learning process and ensure that values remain within a reasonable range. 4. **Position-Wise Feed-Forward Network:** * A feed-forward network is applied independently to each position in the sequence. This network consists of two linear transformations with a ReLU activation function in between. Its purpose is to further refine the representations generated by the attention layer. ![encoder-output](https://hackmd.io/_uploads/HJoermiKA.png) #### Code Implementation ##### PositionWiseFFN ```python= class PositionWiseFFN(nn.Module): """The position-wise feed-forward network.""" def __init__(self, d_model, d_ff): # Using d_model (from Transformer) for consistency super().__init__() self.linear1 = nn.Linear(d_model, d_ff) self.relu = nn.ReLU() self.linear2 = nn.Linear(d_ff, d_model) def forward(self, x): return self.linear2(self.relu(self.linear1(x))) ``` ##### AddNorm ```python= class AddNorm(nn.Module): """Residual connection followed by layer normalization.""" def __init__(self, d_model, dropout): super().__init__() self.dropout = nn.Dropout(dropout) self.ln = nn.LayerNorm(d_model) def forward(self, x, sublayer): # More descriptive variable name return self.ln(self.dropout(sublayer) + x) # Ensure dropout is applied before the addition ``` ##### Encoder ```python= class Encoder(nn.Module): def __init__(self, n_src_vocab, n_layers=6, n_heads=8, d_model=512, d_ff=2048, dropout=0.1, n_position=200, pad_idx=0): super(Encoder, self).__init__() self.src_emb = nn.Embedding(n_src_vocab, d_model, padding_idx=pad_idx) self.pos_emb = PositionalEncoding(d_model, n_position=n_position) self.dropout = nn.Dropout(p=dropout) self.layer_stack = nn.ModuleList([ EncoderLayer(d_model, d_ff, n_heads, dropout=dropout) for _ in range(n_layers)]) self.layer_norm = nn.LayerNorm(d_model) def forward(self, src_seq, src_mask): # -- Forward enc_output = self.dropout(self.pos_emb(self.src_emb(src_seq))) enc_output = self.layer_norm(enc_output) for enc_layer in self.layer_stack: enc_output = enc_layer(enc_output, mask=src_mask) return enc_output ``` ### Decoder: Generating the Target Sequence The decoder's task is to generate the desired output sequence, one token at a time. Its operation is similar to the encoder, but with a few crucial distinctions: 1. **Masked Multi-Head Attention:** * The decoder employs a masked version of multi-head attention. This mask prevents the model from attending to future positions in the output sequence during training. This ensures that predictions are made solely based on the context of past tokens, maintaining the autoregressive property of the decoder. 2. **Encoder-Decoder Attention (Cross-Attention):** * The decoder incorporates a cross-attention layer. Here, the queries originate from the decoder's previous outputs, while the keys and values are derived from the encoder's final outputs. This mechanism allows the decoder to focus on the most relevant parts of the input sequence while generating each output token. 3. **Position-Wise Feed-Forward Network:** * Akin to the encoder, the decoder also features a position-wise feed-forward network to further refine the representations. ![decoder-source-target-2](https://hackmd.io/_uploads/HJcbrQoYA.png) #### Code Implementation ```python= class MaskedMultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.multihead_attn = nn.MultiheadAttention(embed_dim=d_model, num_heads=num_heads, batch_first=True) self.register_buffer("attn_mask", torch.triu(torch.ones(10000, 10000), diagonal=1).bool()) # pre-create a large mask def forward(self, query, key, value): batch_size, seq_len, _ = query.size() attn_mask = self.attn_mask[:seq_len, :seq_len].to(query.device) output, attn_output_weights = self.multihead_attn(query, key, value, attn_mask=attn_mask) return output, attn_output_weights ``` ```python= class Decoder(nn.Module): def __init__(self, n_tgt_vocab, n_layers=6, n_heads=8, d_model=512, d_ff=2048, dropout=0.1, n_position=200, pad_idx=0): super(Decoder, self).__init__() self.tgt_emb = nn.Embedding(n_tgt_vocab, d_model, padding_idx=pad_idx) self.pos_emb = PositionalEncoding(d_model, n_position=n_position) self.dropout = nn.Dropout(p=dropout) self.layer_stack = nn.ModuleList([ DecoderLayer(d_model, d_ff, n_heads, dropout=dropout) for _ in range(n_layers)]) self.layer_norm = nn.LayerNorm(d_model) def forward(self, tgt_seq, tgt_mask, enc_output, src_mask): # -- Forward dec_output = self.dropout(self.pos_emb(self.tgt_emb(tgt_seq))) dec_output = self.layer_norm(dec_output) for dec_layer in self.layer_stack: dec_output = dec_layer(dec_output, enc_output, tgt_mask, src_mask) return dec_output ``` ### The Complete Transformer Architecture The full Transformer architecture is a stack of multiple encoder and decoder layers. Each layer processes the input sequence in parallel and refines the representations to capture complex dependencies and relationships. The final output of the decoder is passed through a linear layer and a softmax function to produce a probability distribution over the vocabulary, allowing the model to predict the next token in the sequence. ![transformer-whole](https://hackmd.io/_uploads/SygTSXiYA.png) ## Example: Building a Chinese-English Translation Model with PyTorch's Transformer This hands-on example demonstrates how to construct a Chinese-to-English translation model using PyTorch's Transformer implementation. The code is adapted and improved from Sovit Ranjan Rath's [Language Translation using PyTorch Transformer](https://debuggercafe.com/language-translation-using-pytorch-transformer/). ### Dataset Selection We'll use the [Tatoeba](https://tatoeba.org/zh-cn/downloads) dataset, which offers a good balance of size and quality for this task. ### Preprocessing We'll use `torchtext` and `spacy` for tokenization and vocabulary management. ### Installation ```python= !pip install -U portalocker !python -m spacy download en_core_web_sm !python -m spacy download zh_core_web_sm ``` ### Imports ```python= # Tokenizer and vocabulary tools from torchtext.data.utils import get_tokenizer from torchtext.vocab import build_vocab_from_iterator from typing import Iterable, List # Padding and data handling from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader, Dataset # Core model and utilities from torch.nn import Transformer from torch import Tensor from timeit import default_timer as timer from sklearn.model_selection import train_test_split from tqdm.auto import tqdm # Other essentials import torch.nn as nn import torch import torch.nn.functional as F import numpy as np import math import os import pandas as pd import matplotlib.pyplot as plt import pickle import csv import time ``` ### Data Preparation and Loading #### Loading and Exploring the Dataset The Tatoeba dataset provides pairs of translated sentences in Chinese and English, stored in a tab-separated values (TSV) format. ##### Example Tatoeba Data (Raw) ``` 1277 I have to go to sleep. 2 我该去睡觉了。 1280 Today is June 18th and it is Muiriel's birthday! 5 今天是６月１８号，也是Muiriel的生日！ ``` We'll use Pandas to efficiently load and process this data. 1. **Load TSV Data with Pandas:** Directly read the TSV file, specifying column names for clarity. 2. **Preview and Filter:** Display the first few rows of relevant data (English and Chinese translations). We can also filter out any unnecessary columns. 3. **Save Processed Data:** Convert the filtered DataFrame to CSV for easier manipulation in the later stages. ```python= # Load TSV Data data_path = 'data/Tatoeba.tsv' df = pd.read_csv(data_path, sep='\t', header=None, names=['index', 'English', 'unknown', 'Chinese']) # Preview data print("Example Tatoeba Data (Raw):") print(df[['English', 'Chinese']].head().to_markdown(index=False, numalign="left", stralign="left")) # Save processed data df[['English', 'Chinese']].to_csv('input/Tatoeba_processed.csv', index=False) ``` ``` Example Tatoeba Data (Raw): | English | Chinese | |:-------------------------------------------------|:--------------------------------------| | I have to go to sleep. | 我该去睡觉了。 | | Today is June 18th and it is Muiriel's birthday! | 今天是６月１８号，也是Muiriel的生日！ | | Muiriel is 20 now. | Muiriel现在20岁了。 | | The password is "Muiriel". | 密码是"Muiriel"。 | | I will be back soon. | 我很快就會回來。 | ``` #### Splitting into Training and Validation Sets We'll use `train_test_split` from the `sklearn.model_selection` module to create training and validation sets. ```python= # Define test size and random seed TEST_SIZE = 0.1 RANDOM_SEED = 42 # Split the data train_csv, valid_csv = train_test_split(df, test_size=TEST_SIZE, random_state=RANDOM_SEED) ``` #### Tokenizer Setup ```python= SRC_LANGUAGE = 'zh' # Source language (Chinese) TGT_LANGUAGE = 'en' # Target language (English) # Placeholders for tokenizer and vocabulary objects token_transform = {} vocab_transform = {} # Initialize tokenizers for both languages token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='zh_core_web_sm') token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm') ``` #### Creating Custom Dataset Classes We define custom PyTorch `Dataset` classes for convenient data loading and batching during training and validation: ```python= class TranslationDataset(Dataset): def __init__(self, csv): self.csv = csv def __len__(self): return len(self.csv) def __getitem__(self, idx): return ( self.csv['Chinese'].iloc[idx], # Source sentence (Chinese) self.csv['English'].iloc[idx] # Target sentence (English) ) train_dataset = TranslationDataset(train_csv) valid_dataset = TranslationDataset(valid_csv) ``` #### Building Vocabularies The `build_vocab_from_iterator` function from `torchtext` is used to create vocabulary objects for both the source and target languages. These vocabularies map each unique word or token to a numerical index. ```python= UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3 special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>'] def yield_tokens(data_iter: Iterable, language: str) -> List[str]: language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1} for data_sample in tqdm(data_iter,total=len(data_iter),leave=False): yield token_transform[language](data_sample[language_index[language]]) vocab_transform = {} for ln in [SRC_LANGUAGE, TGT_LANGUAGE]: vocab_transform[ln] = build_vocab_from_iterator( yield_tokens(train_dataset, ln), min_freq=2, specials=special_symbols, special_first=True ) vocab_transform[ln].set_default_index(UNK_IDX) print(f'Chinese Vocabulary size: {len(vocab_transform[SRC_LANGUAGE])}') print(f'English Vocabulary size: {len(vocab_transform[TGT_LANGUAGE])}') ``` ``` Chinese Vocabulary size: 14116 English Vocabulary size: 9027 ``` ### Essential Data Processing Functions To bridge the gap between raw text data and the Transformer model's numerical input, we define a set of essential functions. These functions create a streamlined pipeline that converts sentences into a format suitable for model training and inference. #### Text Transformation Pipeline The `text_transform` function is a crucial component. It encapsulates three key steps to convert raw text into tensor indices: 1. **Tokenization (`token_transform`):** Breaks down sentences into individual words or subwords (tokens). We utilize spaCy's tokenizers for this, tailored for both Chinese and English. 2. **Numericalization (`vocab_transform`):** Maps each token to its corresponding numerical index in the vocabulary we built earlier. This step is essential for the model to understand the input data. 3. **Tensor Conversion (`tensor_transform`):** Transforms the list of token indices into a PyTorch tensor. Additionally, it adds special start-of-sequence (`<bos>`) and end-of-sequence (`<eos>`) tokens to mark the boundaries of each sentence. These are important for training the Transformer. ```python= def sequential_transforms(*transforms): return lambda x: reduce(lambda a, f: f(a), transforms, x) def tensor_transform(token_ids: List[int]): return torch.cat((torch.tensor([BOS_IDX]), torch.tensor(token_ids), torch.tensor([EOS_IDX]))) text_transform = {} for ln in [SRC_LANGUAGE, TGT_LANGUAGE]: text_transform[ln] = sequential_transforms(token_transform[ln], vocab_transform[ln], tensor_transform) ``` #### Testing Tokenization and Numericalization Let's quickly verify that our text transformation pipeline works as intended: ```python= def translate_tensor(src_tensor: torch.Tensor, ln): tokens = src_tensor[1:-1].numpy() # Remove <bos> and <eos> return " ".join(vocab_transform[ln].lookup_tokens(list(tokens))) def test_tokenize(sentence: str, ln): token_ids = text_transform[ln](sentence).view(-1) print(f'Original: {sentence}') print(f'Tokenized: {token_ids}') print(f'Translated back: {translate_tensor(token_ids, ln)}\n') test_tokenize("It's never too late to study machine learning.", TGT_LANGUAGE) test_tokenize("活到老，学到老。", SRC_LANGUAGE) ``` ``` Original: It's never too late to study machine learning. Tokenized: tensor([ 2, 38, 16, 136, 108, 223, 7, 263, 906, 657, 3]) Translated back: It 's never too late to study machine learning Original: 活到老，学到老。 Tokenized: tensor([ 2, 587, 50, 365, 11, 2473, 365, 4, 3]) Translated back: 活到老，学到老。 ``` #### Batch Preparation (Data Collation) To feed data to the Transformer model efficiently, we need to group sentences into batches. Since sentences vary in length, we use padding to ensure all examples within a batch have the same size. We define a custom function called `collate_fn` to handle this batch preparation process: ```python= # Function to collate data samples into batch tensors def collate_fn(batch): src_batch, tgt_batch = [], [] for src_sample, tgt_sample in batch: # Remove trailing newline characters (common in text files) src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n"))) tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n"))) # Pad sequences for consistent batch length (and ensure batch_first=True) src_batch = pad_sequence(src_batch, padding_value=PAD_IDX, batch_first=True) tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX, batch_first=True) return src_batch, tgt_batch ``` ##### Explanation of collate_fn 1. **Create Empty Lists:** Initializes empty lists (`src_batch`, `tgt_batch`) to store the tokenized and numericalized source and target sentences. 2. **Iterate and Transform:** Loops through each sentence pair in the batch. Applies the `text_transform` to both the source and target sentences, converting them into numerical representations (tensors). It also removes any trailing newline characters (common in text files) before applying the transformations. 3. **Pad to Maximum Length:** Determines the maximum length of the sentences in the batch and uses the `pad_sequence` function to pad shorter sentences with the `<pad>` token (`PAD_IDX`). This creates tensors of uniform length, which is necessary for batch processing. 4. **Return Batch Tensors:** Returns a tuple containing the padded source batch and target batch tensors, ready to be fed into the Transformer model. #### DataLoader We create `DataLoader` instances to efficiently manage batching and data loading during training and validation: ```python= train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=True) valid_dataloader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn) ``` ### Hyperparameters Hyperparameters significantly influence the model's performance, training speed, and memory usage. Finding the right balance is crucial. #### Key Hyperparameters * **Model Size and Architecture:** * **`EMB_SIZE` (Embedding Size):** Determines the dimensionality of the word embeddings. Larger embeddings can capture more nuanced information but require more memory. * **`NHEAD` (Number of Attention Heads):** Controls the number of attention mechanisms used in the Transformer. More heads can potentially learn diverse patterns, but it increases the computational cost. * **`FFN_HID_DIM` (Feedforward Network Hidden Dimension):** The size of the hidden layers in the feedforward networks within each Transformer layer. Larger dimensions offer more capacity but can lead to longer training times. * **`NUM_ENCODER_LAYERS` & `NUM_DECODER_LAYERS`:** The number of encoder and decoder layers in the Transformer. Deeper models are generally more powerful, but they also demand more resources. * **Training Parameters:** * **`BATCH_SIZE`:** The number of sentence pairs processed in each training step. Larger batch sizes can leverage parallel processing but may exceed GPU memory. * **`NUM_EPOCHS`:** The number of times the entire dataset is passed through the model during training. More epochs may improve accuracy, but excessive epochs can lead to overfitting. * **Optimization:** * **Learning Rate:** (Not explicitly listed but crucial): Controls the step size in updating model parameters during optimization. It significantly impacts training speed and convergence. #### Parameter Selection Strategy 1. **GPU Constraints:** Consider your GPU's VRAM. In this example, with a GTX 1650 Ti (4GB VRAM), models with fewer than 10 million parameters are advisable to avoid out-of-memory errors. The provided table can help you choose suitable values for `EMB_SIZE`, `NHEAD`, and the number of layers. 2. **Start Small:** Begin with a smaller model (fewer layers, smaller embedding size) and gradually increase complexity while monitoring performance and resource usage. 3. **Balance Batch Size:** Aim for the largest `BATCH_SIZE` that your GPU can handle without running out of memory. This can speed up training. 4. **Tune the Learning Rate:** Experiment with different learning rates. A common approach is to start with a relatively large learning rate (e.g., 0.001) and gradually decrease it during training. #### Hyperparameters in this Example ```python= SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE]) TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE]) EMB_SIZE = 192 # Embedding size (divisible by NHEAD) NHEAD = 6 # Number of attention heads FFN_HID_DIM = 192 # Feedforward network hidden dimension BATCH_SIZE = 192 NUM_ENCODER_LAYERS = 3 NUM_DECODER_LAYERS = 3 DEVICE = 'cuda' NUM_EPOCHS = 120 ``` ### Model Implementation #### Masking for Transformer Attention Transformers leverage attention mechanisms to weigh the importance of different words in a sentence. However, not all words should be attended to equally in every context. We use masks to control the attention patterns: 1. **Source Mask (`src_mask`):** * Shape: `(S, S)` (S: source sequence length) * Purpose: Informs the encoder which positions in the source sequence are valid for attention. Since all positions are relevant, this mask is typically filled with `False`. 2. **Target Mask (`tgt_mask`):** * Shape:` (T, T)` (T: target sequence length) * Purpose: Prevents the decoder from attending to future positions when generating output. This ensures the model generates translations autoregressively (word by word, from left to right). The mask is an upper triangular matrix where the upper triangle is filled with $-\infty$ and the lower triangle is 0. 3. **Source Padding Mask (`src_padding_mask`):** * Shape: `(N, S)` (N: batch size) * Purpose: Identifies padding tokens (`PAD_IDX`) in the source sequence so the model can ignore them during attention calculations. 3. **Target Padding Mask (`tgt_padding_mask`):** * Shape: `(N, T)` * Purpose: Similar to the source padding mask, but for the target sequence. ```python= def generate_square_subsequent_mask(sz): """Generates an upper-triangular mask for the target sequence.""" mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1) mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0)) return mask def create_mask(src, tgt): """Creates all four masks needed for the Transformer.""" src_seq_len = src.shape[1] tgt_seq_len = tgt.shape[1] tgt_mask = generate_square_subsequent_mask(tgt_seq_len) src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool) src_padding_mask = (src == PAD_IDX) tgt_padding_mask = (tgt == PAD_IDX) return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask ``` #### Positional Encoding The Transformer architecture doesn't inherently have a way to understand the order of words in a sentence. Positional encoding is a technique to inject information about word positions into the model. It adds a unique vector to each word embedding based on its position in the sentence. ```python= class PositionalEncoding(nn.Module): def __init__(self, d_model, dropout, max_len=5000): """ :param max_len: Input length sequence. :param d_model: Embedding dimension. :param dropout: Dropout value (default=0.1) """ super(PositionalEncoding, self).__init__() self.dropout = nn.Dropout(p=dropout) pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) def forward(self, x): """ Inputs of forward function :param x: the sequence fed to the positional encoder model (required). Shape: x: [sequence length, batch size, embed dim] output: [sequence length, batch size, embed dim] """ x = x + self.pe[:, :x.size(1)] return self.dropout(x) ``` #### Token Embedding Token embedding is the process of converting each word in a sentence into a dense vector representation. These embeddings are usually learned during training and capture semantic information about the words. ```python= class TokenEmbedding(nn.Module): def __init__(self, vocab_size: int, emb_size): super(TokenEmbedding, self).__init__() self.embedding = nn.Embedding(vocab_size, emb_size) self.emb_size = emb_size def forward(self, tokens: Tensor): return self.embedding(tokens.long()) * math.sqrt(self.emb_size) ``` #### Sequence-to-Sequence Transformer Model (`Seq2SeqTransformer`) This is the core architecture for our translation task. It consists of: * **Encoder:** Processes the source (Chinese) sentence, producing a contextualized representation of each word. * **Decoder:** Generates the target (English) sentence one word at a time, conditioned on the encoder's output and the previously generated words. * **Token Embeddings and Positional Encoding:** These layers are applied to both the source and target sequences. * **Generator:** The final linear layer that projects the decoder's output into the target vocabulary space for word prediction. ```python= class Seq2SeqTransformer(nn.Module): def __init__( self, num_encoder_layers: int, num_decoder_layers: int, emb_size: int, nhead: int, src_vocab_size: int, tgt_vocab_size: int, dim_feedforward: int = 512, dropout: float = 0.1 ): super(Seq2SeqTransformer, self).__init__() self.transformer = Transformer( d_model=emb_size, nhead=nhead, num_encoder_layers=num_encoder_layers, num_decoder_layers=num_decoder_layers, dim_feedforward=dim_feedforward, dropout=dropout, batch_first=True ) self.generator = nn.Linear(emb_size, tgt_vocab_size) self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size) self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size) self.positional_encoding = PositionalEncoding( emb_size, dropout=dropout) def forward(self, src: Tensor, trg: Tensor, src_mask: Tensor, tgt_mask: Tensor, src_padding_mask: Tensor, tgt_padding_mask: Tensor, memory_key_padding_mask: Tensor): src_emb = self.positional_encoding(self.src_tok_emb(src)) tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg)) outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_padding_mask, tgt_padding_mask, memory_key_padding_mask) return self.generator(outs) def encode(self, src: Tensor, src_mask: Tensor): return self.transformer.encoder(self.positional_encoding( self.src_tok_emb(src)), src_mask) def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor): return self.transformer.decoder(self.positional_encoding( self.tgt_tok_emb(tgt)), memory, tgt_mask) ``` ```python= # Model instantiation and summary model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE, NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM).to(DEVICE) total_params = sum(p.numel() for p in model.parameters()) print(f"{total_params:,} total parameters.") total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"{total_trainable_params:,} training parameters.") ``` ``` 8,928,999 total parameters. 8,928,999 training parameters. ``` ### Training and Validation Now let's put our trained model to the test! We'll define functions to generate translations and then showcase some examples. #### Training and Evaluation Loop The core training and evaluation procedures are encapsulated in the following functions. ```python= def train_epoch(model, optimizer): """Performs one epoch of training.""" model.train() # Set the model to training mode total_loss = 0 # Initialize total loss for the epoch for src, tgt in tqdm(train_dataloader, desc="Training", leave=False, total=len(train_dataloader)): src = src.to(DEVICE) tgt = tgt.to(DEVICE) tgt_input = tgt[:, :-1] # Exclude the last token for input # Create masks src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input) # Forward pass logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask) # Compute loss tgt_out = tgt[:, 1:] # Exclude the first token for output loss = loss_fn(logits.reshape(-1, TGT_VOCAB_SIZE), tgt_out.reshape(-1)) # Backpropagation and optimization optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() # Accumulate loss return total_loss / len(train_dataloader) # Return average loss ``` #### Validation Step ```python= def evaluate(model): """Evaluates the model on the validation set.""" model.eval() # Set the model to evaluation mode total_loss = 0 with torch.no_grad(): for src, tgt in tqdm(val_dataloader, desc="Validating", leave=False, total=len(val_dataloader)): # Similar logic as in training, but without backpropagation src = src.to(DEVICE) tgt = tgt.to(DEVICE) tgt_input = tgt[:, :-1] src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input) logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask) tgt_out = tgt[:, 1:] loss = loss_fn(logits.reshape(-1, TGT_VOCAB_SIZE), tgt_out.reshape(-1)) total_loss += loss.item() return total_loss / len(val_dataloader) # Return average loss ``` #### Model Training and Evaluation We will now execute the training process, iterating over epochs, and tracking both training and validation loss. We will save the model checkpoint exhibiting the lowest validation loss. ```python= # Lists to store loss values train_losses, valid_losses = [], [] best_epoch = -1 best_val_loss = float('inf') for epoch in range(1, NUM_EPOCHS + 1): epoch_start_time = time.time() train_loss = train_epoch(model, optimizer) valid_loss = evaluate(model) epoch_end_time = time.time() train_losses.append(train_loss) valid_losses.append(valid_loss) if epoch%10==0: print(f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {valid_loss:.3f}, " f"Epoch time = {(epoch_end_time - epoch_start_time):.3f}s") # Save the best model if valid_loss < best_val_loss: best_epoch = epoch best_val_loss = valid_loss torch.save(model.state_dict(), 'outputs/model-best.pth') # Save model checkpoints at each epoch (optional) # torch.save(model.state_dict(), f'outputs/model-epoch{epoch}-loss{valid_loss:.5f}.pth') print(f'Best epoch: {best_epoch}') ``` ``` Epoch: 10, Train loss: 3.918, Val loss: 3.773, Epoch time = 931.681s Epoch: 20, Train loss: 3.287, Val loss: 3.269, Epoch time = 888.858s Epoch: 30, Train loss: 2.827, Val loss: 2.954, Epoch time = 529.746s Epoch: 40, Train loss: 2.474, Val loss: 2.749, Epoch time = 974.280s Epoch: 50, Train loss: 2.192, Val loss: 2.621, Epoch time = 1061.845s Epoch: 60, Train loss: 1.973, Val loss: 2.538, Epoch time = 458.240s Epoch: 70, Train loss: 1.794, Val loss: 2.487, Epoch time = 459.281s Epoch: 80, Train loss: 1.653, Val loss: 2.455, Epoch time = 459.611s Epoch: 90, Train loss: 1.532, Val loss: 2.439, Epoch time = 458.892s Epoch: 100, Train loss: 1.435, Val loss: 2.434, Epoch time = 459.066s Epoch: 110, Train loss: 1.440, Val loss: 2.434, Epoch time = 1194.564s Best epoch: 97 ``` ##### Plot Loss Curve ```python= # Plotting the Loss Curve (Code remains the same) def save_plots(train_loss, valid_loss): """ Function to save the loss plots to disk. """ # Loss plots. plt.figure(figsize=(10, 7)) plt.plot( train_loss, color='blue', linestyle='-', label='train loss' ) plt.plot( valid_loss, color='red', linestyle='-', label='validataion loss' ) plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.savefig(os.path.join('outputs', 'loss.png')) plt.show() save_plots(train_losses, valid_losses) ``` ![loss2](https://hackmd.io/_uploads/SJ5s9EyqC.png) ### Translation Examples #### Translation Functions Now that we have a trained model, let's define the functions to translate sentences from Chinese to English. ```python= def greedy_decode(model, src, src_mask, max_len, start_symbol): """ Generates the target sequence using a greedy decoding approach. Args: model: The trained Transformer model. src: The source sentence tensor (Chinese). src_mask: The source mask. max_len: The maximum length for the generated target sequence. start_symbol: The index of the start-of-sequence token (`<bos>`). Returns: The generated target sequence tensor (English). """ src = src.to(DEVICE) src_mask = src_mask.to(DEVICE) memory = model.encode(src, src_mask) # Encoder's output ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE) # Initialize target sequence with <bos> for i in range(max_len - 1): memory = memory.to(DEVICE) # Ensure memory is on the correct device tgt_mask = (generate_square_subsequent_mask(ys.size(1)).type(torch.bool)).to(DEVICE) out = model.decode(ys, memory, tgt_mask) prob = model.generator(out[:, -1]) # Get probabilities for the next word _, next_word = torch.max(prob, dim=1) next_word = next_word.item() ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1) if next_word == EOS_IDX: # Stop if <eos> is generated break return ys def translate(model: torch.nn.Module, src_sentence: str): """Translates a given source sentence into the target language.""" model.eval() # Set the model to evaluation mode # Preprocess the source sentence src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1) # Shape: (src_len, 1) num_tokens = src.shape[0] src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool).to(DEVICE) # Generate the target sequence using greedy decoding tgt_tokens = greedy_decode( model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX ).flatten() # Convert the generated tokens back to text and remove special tokens return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "") def translate_tensor(src_tensor: torch.Tensor, ln): """Helper function to translate a tensor of token IDs back to text.""" ys = src_tensor[1:-1] # Remove <bos> and <eos> return " ".join(vocab_transform[ln].lookup_tokens(list(ys.numpy()))).replace("<bos>", "").replace("<eos>", "") ``` #### Evaluating on Sample Sentences Let's see how our model performs on some example sentences from the validation set: ```python= # Define some sample sentences and their expected translations sample_sentences = [ ("我不怕死。", "I'm not scared to die."), ("你最好確認那是對的。", "You'd better make sure that it is true."), ("The clock has stopped.", "時鐘已經停了"), ("Please check the menu.", "請確認菜單."), ("I am a student.", "我是個學生."), ("Why is the training so slow?","為什麼訓練這麼慢"), ("We are ready for the test.","我們準備好考試了"), ("The dog haven't eat anything for 3 days.","狗三天沒吃任何東西"), ("Yesterday, I went to a park.","昨天我去了公園"), ("I love eating spicy food.","我愛吃辣的食物"), ("Knowledge is power.","知識就是力量"], ("Don't judge a book by its cover.","人不可貌相") ] # Iterate through the sample sentences and print translations for src_sentence, expected_translation in sample_sentences: predicted_translation = translate(model, src_sentence) print(f"SRC: {src_sentence}") print(f"Expected: {expected_translation}") print(f"Predicted: {predicted_translation}\n") ``` ``` SRC: I'm not scared to die GT: 我不怕死. PRED: 我不怕死。 SRC: You'd better make sure that it is true. GT: 你最好確認那是對的. PRED: 你最好做的事情是真的。 SRC: The clock has stopped. GT: 時鐘已經停了 PRED: 時鐘已經停止了。 SRC: Please check the menu. GT: 請確認菜單. PRED: 请把菜单。 SRC: I am a student. GT: 我是個學生. PRED: 我是學生。 SRC: Why is the training so slow? GT: 為什麼訓練這麼慢 PRED: 为什么这么慢？ SRC: We are ready for the test. GT: 我們準備好考試了 PRED: 我們準備好考試。 SRC: The dog haven't eat anything for 3 days. GT: 狗三天沒吃任何東西 PRED: 狗的三天没吃东西。 SRC: Yesterday, I went to a park. GT: 昨天我去了公園 PRED: 昨天我去公園了。 SRC: I love eating spicy food. GT: 我愛吃辣的食物 PRED: 我爱吃辣。 SRC: Knowledge is power. GT: 知識就是力量 PRED: 知識就是力量。 SRC: Don't judge a book by its cover. GT: 人不可貌相 PRED: 不要根据封面一本书。 ``` ### Challenges and Solutions #### Memory Exceedance (Out-of-Memory Errors) * **Model Size:** If your model is too complex (large EMB_SIZE, many layers), it consumes more memory. Try reducing the embedding size or the number of layers. Increasing MIN_FREQ for vocabularies can also help by reducing their size. * **Long Sequences:** If your dataset contains extremely long sentences, padding can lead to large batches that exceed your GPU memory. Consider setting a maximum sequence length and truncating longer sentences. * **Batch Size:** Reducing the batch size directly decreases memory usage per step. However, if your model is still too large, you might encounter issues even with small batch sizes. #### Long Training Time * **Progress Monitoring (`tqdm`):** Always use the `tqdm` library for loops that might take a while. This provides a progress bar and time estimate, allowing you to better gauge training duration. * **Use GPU:** Training on a CPU is significantly slower than on a GPU. Ensure your code is running on a GPU if available. * **Batch Size Optimization:** While smaller batch sizes reduce memory usage, they can also lead to slower training. Experiment to find a balance between memory constraints and training speed. Increasing the batch size too much can lead to out-of-memory errors. #### Performance Issues * **Data Quantity:** Language models, particularly larger ones, thrive on massive amounts of data. Ensure your dataset contains enough examples (at least 10,000 sentence pairs, ideally more) to enable the model to generalize well. * **Data Quality:** Always inspect your dataset for potential issues. Web-crawled data can sometimes have misaligned translations. If you can't filter out bad examples, consider using a different, cleaner dataset. #### Additional Tips * **Mixed Precision Training:** Using mixed precision (e.g., `torch.cuda.amp`) can significantly reduce memory usage and speed up training on compatible GPUs. * **Gradient Accumulation:** If you need larger batch sizes but are limited by memory, consider gradient accumulation. This allows you to simulate larger batch sizes by accumulating gradients over multiple smaller batches. * **Profiling:** Use PyTorch's profiling tools to identify performance bottlenecks in your code. This can guide you in optimizing specific parts of your training loop. * **Experimentation:** Don't be afraid to experiment with different hyperparameters and architectures. The optimal configuration will depend on your specific dataset and goals. ## Reference * Andreas Geiger. "L08 - Sequence Models." Deep Learning Course, University of Tübingen. Accessed August 3, 2024. [link](https://drive.google.com/file/d/1-riYFFIfJCCP5K002n5LepGL-AEhPBwK/view) * Aston Zhang, Zack C. Lipton, Mu Li, and Alex J. Smola. "Chapter 11 - Attention Mechanisms." Dive into Deep Learning. Accessed August 3, 2024. [link](https://d2l.ai/) * Ashish Vaswani, et al. "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NIPS 2017). [link](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) * Zake. "從零開始的 Sequence to Sequence." September 28, 2017. Accessed August 3, 2024. [link](https://zake7749.github.io/2017/09/28/Sequence-to-Sequence-tutorial/) * Jay Alammar. "The Illustrated Transformer." Accessed August 3, 2024. [link](https://jalammar.github.io/illustrated-transformer/) * Foam Liu. "英中机器文本翻译." GitHub repository. Accessed August 3, 2024. [link](https://github.com/foamliu/Transformer/tree/master) * Sovit Ranjan Rath. "Language Translation using PyTorch Transformer." DebuggerCafe. Accessed August 3, 2024. [link](https://debuggercafe.com/language-translation-using-pytorch-transformer/)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.