---
# System prepended metadata

title: Final-term Project

---

# Final-term Project

20337073 林均 (Lamkuan)

## Topic
### Final-term Project: Chinese-to-English Machine Translation

* 数据集说明：压缩包中共有 3 个文件夹，分别对应着训练集、
* 评估集和测试集，它们的大小分别是 10000、1000、1000。
* 每个文件夹中都含有 src data (中文) 和 target data (英文)

## Content

### 0. The experimental method

**1. Data cleaning**
The first step is to clean the data. This involves removing illegal characters, rare words, and excessively long sentences. The data can be cleaned using regular expressions or natural language processing tools.

**2. Word segmentation**
The next step is to segment the words in the sentences. This is necessary because machine translation models typically work with words or subwords, rather than characters. For English, this can be done using statistical methods such as NLTK, BPE, or WordPiece. For Chinese, word segmentation tools such as Jieba or HanLP can be used.

**3. Dictionary construction**
A statistical dictionary can be constructed from the results of the word segmentation step. This dictionary can be used to filter out words that occur less frequently, and to prevent the size of the vocabulary from becoming too large. It is recommended to initialize the word vectors in the machine translation model with pre-trained word vectors, which can be obtained from a variety of sources.

**4. Seq2Seq model**
The next step is to build a Seq2Seq model. This is a type of neural network that can be used for machine translation. The Seq2Seq model can be implemented using a variety of frameworks, such as PyTorch or TensorFlow. The model should have 3-4 layers each for the encoder and decoder, and it can be one-way or two-way.

**5. Attention mechanism**
The attention mechanism is a technique that can be used to improve the performance of Seq2Seq models. The attention mechanism allows the model to focus on specific parts of the input sentence when generating the output sentence. The attention mechanism can be implemented in a variety of ways, such as using dot product, multiplicative, or additive alignment functions.

**6. Evaluation**
The final step is to evaluate the performance of the machine translation model. This can be done by using a variety of metrics, such as BLEU score, ROUGE score, and METEOR score. The model can be evaluated on a held-out dataset, or on a test set.

### 1.Principle

#### 1. Seq2Seq model based on GRU

(1) LuongAttnDecoderRNN:  a class called LuongAttnDecoderRNN that implements a Luong attention decoder for sequence-to-sequence (Seq2Seq) models. The class takes five arguments in its constructor:

* attn_model: The type of attention mechanism to use. The following methods are supported:
* dot: The dot product attention mechanism.
* general: The general attention mechanism.
* concat: The concatenation attention mechanism.
* hidden_size: The size of the hidden state.
* output_size: The size of the output vocabulary.
* n_layers: The number of layers in the GRU.
* dropout: The dropout rate.


```python3
class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()

        # Keep for reference
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout

        # Define layers
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.embedding_dropout = nn.Dropout(dropout)
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(attn_model, hidden_size)

    def forward(self, input_step, last_hidden, encoder_outputs):
        # Note: we run this one step (word) at a time
        # Get embedding of current input word
        embedded = self.embedding(input_step)
        embedded = self.embedding_dropout(embedded)
        # Forward through unidirectional GRU
        rnn_output, hidden = self.gru(embedded, last_hidden)
        # Calculate attention weights from the current GRU output
        attn_weights = self.attn(rnn_output, encoder_outputs)
        # Multiply attention weights to encoder outputs to get new "weighted sum" context vector
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
        # Concatenate weighted context vector and GRU output using Luong eq. 5
        rnn_output = rnn_output.squeeze(0)
        context = context.squeeze(1)
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))
        # Predict next word using Luong eq. 6
        output = self.out(concat_output)
        output = F.softmax(output, dim=1)
        # Return output and final hidden state
        return output, hidden
```

The class defines a forward method that takes three arguments:

* input_step: The current input word.
* last_hidden: The previous hidden state.
* encoder_outputs: The encoder's outputs.

The forward method first converts the current input word to a word embedding using the embedding layer. Then, it drops out the word embedding with probability dropout. Next, it forwards the word embedding through the GRU module and obtains the GRU output and the hidden state. Then, it calculates the attention weights from the GRU output using the specified attention mechanism. Next, it multiplies the attention weights to the encoder outputs to get a new "weighted sum" context vector. Then, it concatenates the weighted context vector and the GRU output using Luong's equation 5. Then, it applies atanh to the concatenated vector. Then, it predicts the next word using Luong's equation 6. Finally, it returns the output and the final hidden state.

The LuongAttnDecoderRNN class can be used to implement Luong attention decoders for Seq2Seq models. The following are some of the advantages of using Luong attention decoders:

* They can generate fluent and accurate outputs.
* They are relatively simple to implement.

The following are some of the disadvantages of using Luong attention decoders:

* They can be less efficient than other types of attention decoders.
* They can be less effective for long input sequences.

Here are some additional tips that may be helpful:

* Use a large dataset. Luong attention decoders are data-hungry, so it is important to use a large dataset to train them. The dataset should contain a variety of input sequences, so that the model can learn to generate different types of outputs for different types of inputs.
* Use a good optimizer. The optimizer is responsible for updating the model's weights during training. A good optimizer will help the model converge to a good solution more quickly.
* Use a good loss function. The loss function measures the error between the model's predictions and the ground truth. A good loss function will help the model learn to minimize the error.
* Use a regularization technique. Regularization techniques can help to prevent the model from overfitting the training data. Overfitting occurs when the model learns the training data too well, and as a result, it is not able to generalize to new data.
* Use a validation set. The validation set is used to evaluate the model's performance during training. The model should not be trained on the validation set, as this can lead to overfitting.
* Use early stopping. Early stopping is a technique that can help to prevent overfitting. Early stopping stops training the model when the model's performance on the validation set stops improving.


(2) EncoderRNN: a class called EncoderRNN that implements a bidirectional GRU encoder for sequence-to-sequence (Seq2Seq) models. The class takes three arguments in its constructor:

* input_size: The size of the input vocabulary.
* hidden_size: The size of the hidden state.
* n_layers: The number of layers in the GRU.
* dropout: The dropout rate.

```python3
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)

        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features == hidden_size
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        # Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = torch.nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, :, self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden
```

The class defines a forward method that takes three arguments:

* input_seq: The input sequence.
* input_lengths: The lengths of the input sequences.
* hidden: The initial hidden state.

The forward method first converts the input sequence to word embeddings using the embedding layer. Then, it packs the padded batch of sequences for the GRU module using the pack_padded_sequence function. Next, it forwards the packed batch of sequences through the GRU module and obtains the outputs and the hidden state. Finally, it unpacks the padding from the outputs using the pad_packed_sequence function, sums the bidirectional GRU outputs, and returns the outputs and the final hidden state.

The EncoderRNN class can be used to implement bidirectional GRU encoders for Seq2Seq models. The following are some of the advantages of using bidirectional GRU encoders:

* They can capture long-range dependencies between input sequences.
* They can be used to train Seq2Seq models that can generate fluent and accurate outputs.

The following are some of the disadvantages of using bidirectional GRU encoders:

* They can be more complex to train than unidirectional GRU encoders.
* They can be more computationally expensive to train and run.

Here are some additional tips that may be helpful:

* Use a large dataset. Bidirectional GRU encoders are data-hungry, so it is important to use a large dataset to train them. The dataset should contain a variety of input sequences, so that the model can learn to generate different types of outputs for different types of inputs.
* Use a good optimizer. The optimizer is responsible for updating the model's weights during training. A good optimizer will help the model converge to a good solution more quickly.
* Use a good loss function. The loss function measures the error between the model's predictions and the ground truth. A good loss function will help the model learn to minimize the error.
* Use a regularization technique. Regularization techniques can help to prevent the model from overfitting the training data. Overfitting occurs when the model learns the training data too well, and as a result, it is not able to generalize to new data.
* Use a validation set. The validation set is used to evaluate the model's performance during training. The model should not be trained on the validation set, as this can lead to overfitting.
* Use early stopping. Early stopping is a technique that can help to prevent overfitting. Early stopping stops training the model when the model's performance on the validation set stops improving.


(3) Attn: a class called Attn that implements a variety of attention mechanisms. The class takes two arguments in its constructor:

* method: The type of attention mechanism to use. The following methods are supported:
* dot: The dot product attention mechanism.
* general: The general attention mechanism.
* concat: The concatenation attention mechanism.
* hidden_size: The size of the hidden state.

```python3
class Attn(torch.nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        if self.method not in ['dot', 'general', 'concat']:
            raise ValueError(self.method, "is not an appropriate attention method.")
        self.hidden_size = hidden_size
        if self.method == 'general':
            self.attn = torch.nn.Linear(self.hidden_size, hidden_size)
        elif self.method == 'concat':
            self.attn = torch.nn.Linear(self.hidden_size * 2, hidden_size)
            self.v = torch.nn.Parameter(torch.FloatTensor(hidden_size))

    def dot_score(self, hidden, encoder_output):
        return torch.sum(hidden * encoder_output, dim=2)

    def general_score(self, hidden, encoder_output):
        energy = self.attn(encoder_output)
        return torch.sum(hidden * energy, dim=2)

    def concat_score(self, hidden, encoder_output):
        energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
        return torch.sum(self.v * energy, dim=2)

    def forward(self, hidden, encoder_outputs):
        # Calculate the attention weights (energies) based on the given method
        if self.method == 'general':
            attn_energies = self.general_score(hidden, encoder_outputs)
        elif self.method == 'concat':
            attn_energies = self.concat_score(hidden, encoder_outputs)
        elif self.method == 'dot':
            attn_energies = self.dot_score(hidden, encoder_outputs)

        # Transpose max_length and batch_size dimensions
        attn_energies = attn_energies.t()

        # Return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1)
```

The class defines three methods for calculating the attention weights (energies):

* dot_score: Calculates the attention weights using the dot product attention mechanism.
* general_score: Calculates the attention weights using the general attention mechanism.
* concat_score: Calculates the attention weights using the concatenation attention mechanism.

#### 2. Train the Seq2Seq model

```python3
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, encoder_optimizer,
          decoder_optimizer):
    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    lengths = lengths.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths.to("cpu"))
    # print('encoder_outputs.size(): ' + str(encoder_outputs.size()))
    # print('encoder_hidden.size(): ' + str(encoder_hidden.size()))

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_token for _ in range(chunk_size)]])
    decoder_input = decoder_input.to(device)
    # print('decoder_input.size(): ' + str(decoder_input.size()))

    # Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[:decoder.n_layers]
    # print('decoder_hidden.size(): ' + str(decoder_hidden.size()))

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
    # print('use_teacher_forcing: ' + str(use_teacher_forcing))

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(chunk_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropatation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = torch.nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = torch.nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals
```

#### Arguments: 

* input_variable: A tensor of shape [batch_size, sequence_length] containing the input sequences.
* lengths: A tensor of shape [batch_size] containing the lengths of the input sequences.
* target_variable: A tensor of shape [batch_size, max_target_len] containing the target sequences.
* mask: A tensor of shape [batch_size, max_target_len] containing a mask that indicates which elements of the target sequences are valid.
* max_target_len: The maximum length of the target sequences.
* encoder: An encoder RNN.
* decoder: A decoder RNN.
* encoder_optimizer: An optimizer for the encoder RNN.
* decoder_optimizer: An optimizer for the decoder RNN.

#### function works as follows:

1. It zeroes the gradients of the encoder and decoder RNNs.
1. It sets the device options for the input, target, and mask tensors.
1. It initializes the loss to 0 and a list of print losses to [].
1. It forwards the input sequences through the encoder RNN to obtain the encoder outputs and hidden state.
1. It creates an initial decoder input by setting the first element to the SOS token and the remaining elements to 0.
1. It sets the initial decoder hidden state to the encoder's final hidden state.
1. It determines whether to use teacher forcing or not.
1. If using teacher forcing, it forwards the decoder one time step at a time, using the target sequence as the next input.
1. If not using teacher forcing, it generates the next input by sampling from the decoder's output distribution.
1. For each time step, it calculates the loss using the masked cross-entropy loss function.
1. It accumulates the loss and the number of valid elements in the target sequence.
1. It performs backpropagation to calculate the gradients of the encoder and decoder RNNs with respect to the loss.
1. It clips the gradients to prevent them from becoming too large.
1. It updates the weights of the encoder and decoder RNNs using the gradients.
1. It returns the average loss over the batch.

#### 3. Pre-process:

(1) build_wordmap_zh: a function called build_wordmap_zh that builds a word map for the Chinese language.

I use Jieba segmentation tool, and adds four special tokens to the word map: <pad>, <start>, <end>, and <unk>.


```python3
def build_wordmap_zh():
    translation_path = os.path.join(train_translation_folder, train_translation_zh_filename)

    with open(translation_path, 'r') as f:
        sentences = f.readlines()

    word_freq = Counter()

    for sentence in tqdm(sentences):
        seg_list = jieba.cut(sentence.strip())
        # Update word frequency
        word_freq.update(list(seg_list))

    # Create word map
    # words = [w for w in word_freq.keys() if word_freq[w] > min_word_freq]
    words = word_freq.most_common(output_lang_vocab_size - 4)
    word_map = {k[0]: v + 4 for v, k in enumerate(words)}
    word_map['<pad>'] = 0
    word_map['<start>'] = 1
    word_map['<end>'] = 2
    word_map['<unk>'] = 3
    print(len(word_map))
    print(words[:10])

    with open('data/WORDMAP_zh.json', 'w') as file:
        json.dump(word_map, file, indent=4)
```

(2) build_wordmap_en: a function called build_wordmap_en that builds a word map for the English language. The function takes no arguments and returns a word map.
    
I use NLTK segmentation tool, and adds four special tokens to the word map: <pad>, <start>, <end>, and <unk>.

```python3
def build_wordmap_en():
    translation_path = os.path.join(train_translation_folder, train_translation_en_filename)

    with open(translation_path, 'r') as f:
        sentences = f.readlines()

    word_freq = Counter()

    for sentence in tqdm(sentences):
        sentence_en = sentence.strip().lower()
        tokens = [normalizeString(s) for s in nltk.word_tokenize(sentence_en) if len(normalizeString(s)) > 0]
        # Update word frequency
        word_freq.update(tokens)

    # Create word map
    # words = [w for w in word_freq.keys() if word_freq[w] > min_word_freq]
    words = word_freq.most_common(input_lang_vocab_size - 4)
    word_map = {k[0]: v + 4 for v, k in enumerate(words)}
    word_map['<pad>'] = 0
    word_map['<start>'] = 1
    word_map['<end>'] = 2
    word_map['<unk>'] = 3
    print(len(word_map))
    print(words[:10])

    with open('data/WORDMAP_en.json', 'w') as file:
        json.dump(word_map, file, indent=4)

```

    
(3) build_samples: a function called build_samples that builds a dataset of translation samples. The function takes no arguments and returns a list of samples.
    
1. It loads the word maps for the Chinese and English languages.
2. For each usage, train or valid:
* It loads the translation files for the Chinese and English languages.
* It builds a list of samples. For each sentence in the Chinese translation file:
* It segments the sentence using the jieba library.
* It encodes the sentence using the Chinese word map.
* It converts the sentence to lower case.
* It tokenizes the sentence using the nltk library.
* It removes empty tokens.
* It encodes the sentence using the English word map.
* If the length of the input sentence and the output sentence are both less than or equal to the maximum length, and the UNK token does not appear in either sentence, then the sentence is added to the list of samples.
3. It saves the list of samples to a file called data/samples_train.json or data/samples_valid.json .


## Results of the experiment
    
#### test.py
```python
# import the necessary packages

from models import EncoderRNN, LuongAttnDecoderRNN
from utils import *
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

if __name__ == '__main__':
    input_lang = Lang('data/WORDMAP_zh.json')
    output_lang = Lang('data/WORDMAP_en.json')
    print("input_lang.n_words: " + str(input_lang.n_words))
    print("output_lang.n_words: " + str(output_lang.n_words))

    checkpoint = '{}/BEST_checkpoint.tar'.format(save_dir)  # model checkpoint
    print('checkpoint: ' + str(checkpoint))
    # Load model
    checkpoint = torch.load(checkpoint)
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']

    print('Building encoder and decoder ...')
    # Initialize encoder & decoder models
    encoder = EncoderRNN(input_lang.n_words, hidden_size, encoder_n_layers, dropout)
    decoder = LuongAttnDecoderRNN(attn_model, hidden_size, output_lang.n_words, decoder_n_layers, dropout)

    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)

    # Use appropriate device
    encoder = encoder.to(device)
    decoder = decoder.to(device)
    print('Models built and ready to go!')

    # Set dropout layers to eval mode
    encoder.eval()
    decoder.eval()

    references = []
    hypothesis = []

    # Initialize search module
    searcher = GreedySearchDecoder(encoder, decoder)
    for input_sentence, target_sentence in pick_n_valid_sentences(input_lang, output_lang, 10):
        decoded_words = evaluate(searcher, input_sentence, input_lang, output_lang)
        print('> {}'.format(input_sentence))
        print('= {}'.format(target_sentence))
        print('< {}'.format(' '.join(decoded_words)))

        reference = []
        reference.append(target_sentence.split())
        references.append(target_sentence.split())

        candidate = list(decoded_words)
        hypothesis.append(candidate)

        # print(reference)

        print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
        print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
        print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
        print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

    bleu_score = corpus_bleu([[ref] for ref in references], [hyp for hyp in hypothesis])

    print('blue_score: ', bleu_score)
```
The code it provided is for a machine translation experiment using the BLEU-4 metric. BLEU-4 is a metric that measures the similarity between a machine translation and a human-translated reference. It does this by comparing the n-gram (sequences of n words) overlap between the two translations.

The code first loads the model checkpoint and then initializes the encoder and decoder models. The encoder and decoder models are then loaded with the state dicts from the checkpoint. The encoder and decoder models are then set to eval mode.

The code then iterates through a list of input and target sentences. For each sentence, the searcher module is used to generate a hypothesis. The hypothesis is then compared to the target sentence using the BLEU-4 metric. The cumulative BLEU-4 score for each sentence is then printed.

The code then calculates the corpus BLEU-4 score for all of the sentences. The corpus BLEU-4 score is then printed.

#### Screenshot of experiment:
            

![](https://h![](https://hackmd.io/_uploads/H1aHQObO2.png)
![](https://hackmd.io/_uploads/BJW_7dWOn.png)


The results of the experiment can be interpreted as follows:

* The cumulative BLEU-4 scores for each sentence indicate how similar the hypothesis is to the target sentence. A higher score indicates a more similar hypothesis.
* The corpus BLEU-4 score indicates the overall similarity between the hypotheses and the target sentences. A higher score indicates more similar hypotheses.


## Conclusion


From this experiment, I learned about the following:
            
* Sequence-to-sequence models with attention: I learned about the basic architecture of a sequence-to-sequence model with attention, and how it can be used for machine translation.
* The PyTorch deep learning library: I learned about the PyTorch deep learning library, and how it can be used to implement sequence-to-sequence models with attention.
* Evaluation metrics for machine translation: I learned about the BLEU score, which is a common metric for evaluating machine translation models.
            
Overall, this experiment was a good learning experience. I learned about the challenges of machine translation, the basics of sequence-to-sequence models with attention, and some of the factors that can affect the performance of these models. I also had the opportunity to experiment with different settings and parameters to see how they affected the performance of the model. This experience will help I to better understand machine translation and to build better machine translation models in the future.