# 【Hung-yi Lee 機器學習 - L5 : Transformer Sequence-to-sequence 】
:::info
- 參考 [2021 Spring](https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.php)、[2022 Spring](https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php)、[2023 Spring](https://speech.ee.ntu.edu.tw/~hylee/ml/2023-spring.php)、[2024 Spring](https://speech.ee.ntu.edu.tw/~hylee/genai/2024-spring.php)、[2025](https://speech.ee.ntu.edu.tw/~hylee/ml/2025-spring.php)
- Knowledge Distillation(知識蒸餾)
- Lecture 5 : Transformer Sequence-to-sequence
- HW 5 : Transformer
:::
<br/>
## Knowledge Distillation(知識蒸餾)
是一種模型壓縮技術,讓一個訓練好的「老師模型」,教一個「學生模型」學它的結果,達成模型變小但效果還不錯的目標

>PS 複習 softmax 機率相加會=1
>
softmax 都除以一個數字T,讓老師模型變得平滑,有助於學生模型學習

也可以使用更少的bits、分群、Binary Network 來壓縮model



Depthwise Separable Convolution
是一種高效的卷積操作,在不顯著犧牲模型準確率的前提下,降低計算量和參數數量
(複習)CNN 2個channel,每個fileter高度=2,數量沒有限制
因此 input channel 可以不等於 output channel

Depthwise Separable Convolution
channel 數量 = filter 數量,每個filter只管一個channel
因此 input channel 要等於 output channel



但Depthwise Convolution 無法看出channel關聯性
可以使用Pointwise Convolution(強制要求filter要1x1)







<br/>
## Lecture 4 : Transformer Sequence-to-sequence
Transformer 是一種完全基於注意力機制(Self-Attention)的模型架構,擅長處理序列資料,取代了傳統的 RNN 和 LSTM
- Self-Attention 自注意力機制 : 讓模型自己去「注意」輸入序列中的關鍵部分。比如翻譯一句話時,模型能知道「他」指的是「John」而不是其他人。用三個向量計算:Query(Q)、Key(K)、Value(V),輸出是一個「加權平均」的 Value,權重來自 Query 和 Key 的相似度。
- 多頭注意力(Multi-Head Attention): 把 self-attention 做很多次(多個 head),每個 head 可以專注於不同部分
- Positional Encoding : transformer 沒有 RNN 的「順序性」,所以需要加入位置資訊,加進一組 sin/cos 函數編碼,讓模型知道第幾個詞



Encoder Decoder
Encoder 負責理解輸入,Decoder 負責生成輸出








CTC
是一種損失函數,用來訓練模型,在不知道輸出對齊位置的情況下,仍能正確預測序列
假設要做語音辨識,輸入:1200 個時間點的聲波特徵,要輸出:hello(5 個字母),在不知道哪個聲波對應哪個字母情況下,用 CTC 就可以讓模型自己學會如何把長序列對應成短序
例如,列出所有能 collapse 成正確輸出的 path(例如 "h∅e∅l∅l∅o"、"hheelloo" 都等於 "hello"

<br/>
## HW 5 : Transformer

- Paired data : TED2020 有對應的 英文原版和中文翻譯
- Monolingual data : 純中文字
要訓練 RNN or Transofromer 翻譯模型
python 需要降版
```=
!python3 --version
```

```=
!pip install 'torch>=1.6.0' editdistance matplotlib sacrebleu sacremoses sentencepiece tqdm wandb
```
下載檔案
```=
!git clone https://github.com/pytorch/fairseq.git
```
```=
!pip install pip==24.0
!cd fairseq && git checkout 3f6ba43
!pip install --upgrade /workspace/fairseq
```
```=
import sys
import pdb
import pprint
import logging
import os
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils import data
import numpy as np
import tqdm.auto as tqdm
from pathlib import Path
from argparse import Namespace
from fairseq import utils
import matplotlib.pyplot as plt
```
```=
seed = 33
random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
```
解壓縮
```=
from pathlib import Path
data_dir = '/workspace/DATA/rawdata'
dataset_name = 'ted2020'
urls = (
"https://github.com/figisiwirf/ml2023-hw5-dataset/releases/download/v1.0.1/ml2023.hw5.data.tgz",
"https://github.com/figisiwirf/ml2023-hw5-dataset/releases/download/v1.0.1/ml2023.hw5.test.tgz"
)
file_names = (
'ted2020.tgz', # train & dev
'test.tgz', # test
)
prefix = Path(data_dir).absolute() / dataset_name
print(prefix)
# 創建資料夾
prefix.mkdir(parents=True, exist_ok=True)
for u, f in zip(urls, file_names):
path = prefix / f
if not path.exists():
# 使用 curl 代替 wget
!curl -L {u} -o {path}
if path.suffix == ".tgz":
!tar -xvf {path} -C {prefix}
elif path.suffix == ".zip":
!unzip -o {path} -d {prefix}
# 檢查解壓的文件是否存在,並進行移動
if (prefix / 'raw.en').exists() and (prefix / 'raw.zh').exists():
!mv {prefix/'raw.en'} {prefix/'train_dev.raw.en'}
!mv {prefix/'raw.zh'} {prefix/'train_dev.raw.zh'}
print("raw已成功移動")
else:
print("解壓後的 raw.en 或 raw.zh 文件不存在!")
if (prefix / 'test.en').exists() and (prefix / 'test.zh').exists():
!mv {prefix/'test.en'} {prefix/'test.raw.en'}
!mv {prefix/'test.zh'} {prefix/'test.raw.zh'}
print("test已成功移動")
else:
print("解壓後的 test.en 或 test.zh 文件不存在!")
```



Language
```=
#
src_lang = 'en'
tgt_lang = 'zh'
with open(f"{data_prefix}.{src_lang}", "r", encoding="utf-8") as file:
lines = file.readlines()
for line in lines[:30]:
print(line.strip())
with open(f"{data_prefix}.{tgt_lang}", "r", encoding="utf-8") as file:
lines = file.readlines()
for line in lines[:30]:
print(line.strip())
```


Preprocess files
```=
# 清理語料庫(corpus)中中英文的句子配對
import re
def strQ2B(ustring):
"""Full width -> half width"""
ss = []
for s in ustring:
rstring = ""
for uchar in s:
inside_code = ord(uchar) # 遍歷 s 中的每一個字元 uchar,取得其 Unicode 碼點
if inside_code == 12288: # 全形空白(Unicode 12288),轉為半形空白(Unicode 32)
inside_code = 32
elif (inside_code >= 65281 and inside_code <= 65374): # 如果是其他全形符號(範圍在 65281~65374),則減去 65248 變為對應的半形符號
inside_code -= 65248
rstring += chr(inside_code) # 轉換好的字符拼成一個新字串,加入 ss
ss.append(rstring)
return ''.join(ss)
def clean_s(s, lang):
if lang == 'en':
s = re.sub(r"\([^()]*\)", "", s) # 移除小括號內的文字
s = s.replace('-', '') # 移除 '-'
s = re.sub('([.,;!?()\"])', r' \1 ', s) # 標點符號前後加上空格
elif lang == 'zh':
s = strQ2B(s) # Q2B 全形轉半形
s = re.sub(r"\([^()]*\)", "", s)
s = s.replace(' ', '')
s = s.replace('—', '')
s = s.replace('“', '"')
s = s.replace('”', '"')
s = s.replace('_', '')
s = re.sub('([。,;!?()\"~「」])', r' \1 ', s)
s = ' '.join(s.strip().split())
return s
def len_s(s, lang):
if lang == 'zh':
return len(s) # 中文的長度直接用 len(s)(因為每個字就是一個 token)
return len(s.split()) # 英文用 split() 分開單字後數量當作長度
def clean_corpus(prefix, l1, l2, ratio=9, max_len=1000, min_len=1): # ratio 長度比例過高就剃除(避免配對錯誤)
if Path(f'{prefix}.clean.{l1}').exists() and Path(f'{prefix}.clean.{l2}').exists():
print(f'{prefix}.clean.{l1} & {l2} exists. skipping clean.')
return
with open(f'{prefix}.{l1}', 'r') as l1_in_f:
with open(f'{prefix}.{l2}', 'r') as l2_in_f:
with open(f'{prefix}.clean.{l1}', 'w') as l1_out_f:
with open(f'{prefix}.clean.{l2}', 'w') as l2_out_f:
for s1 in l1_in_f:
s1 = s1.strip()
s2 = l2_in_f.readline().strip() # 逐行讀取
s1 = clean_s(s1, l1)
s2 = clean_s(s2, l2)
s1_len = len_s(s1, l1)
s2_len = len_s(s2, l2)
if min_len > 0: # 移除太短的句子
if s1_len < min_len or s2_len < min_len:
continue
if max_len > 0: # 移除太長的句子
if s1_len > max_len or s2_len > max_len:
continue
if ratio > 0: # 一邊句子比另一邊長太多,判定為配對錯誤
if s1_len/s2_len > ratio or s2_len/s1_len > ratio:
continue
print(s1, file=l1_out_f)
print(s2, file=l2_out_f)
```
```=
clean_corpus(data_prefix, src_lang, tgt_lang)
clean_corpus(test_prefix, src_lang, tgt_lang, ratio=-1, min_len=-1, max_len=-1) # 清洗測試資料,不套用任何長度過
```
```=
print(f"{data_prefix}.clean.{src_lang}")
print(f"{data_prefix}.clean.{tgt_lang}")
print(f"{test_prefix}.clean.{src_lang}")
print(f"{test_prefix}.clean.{tgt_lang}")
```

切分訓練 驗證 測試集
```=
valid_ratio = 0.01 # 3000~4000 would suffice
test_ratio = 0.005
train_ratio = 1 - valid_ratio - test_ratio
```
```=
print(f'{prefix}/train.raw.en') # original,解壓縮後刪除
print(f'{prefix}/train_dev.raw.en') # 解壓縮
print(f'{data_prefix}.clean.{src_lang}') # 清洗後的
print(prefix/f'new_train.clean.{src_lang}') # 之後要重新分配的
```

```=
if (prefix/f'new_train.clean.{src_lang}').exists() \
and (prefix/f'new_train.clean.{tgt_lang}').exists() \
and (prefix/f'new_valid.clean.{src_lang}').exists() \
and (prefix/f'new_valid.clean.{tgt_lang}').exists() \
and (prefix/f'new_test.clean.{src_lang}').exists() \
and (prefix/f'new_test.clean.{tgt_lang}').exists():
print(f'train/valid/test splits exists. skipping split.')
else:
line_num = sum(1 for line in open(f'{data_prefix}.clean.{src_lang}')) # 用清洗過後的檔切新檔案
labels = list(range(line_num)) # 建立隨機標籤序列(打亂順序)
random.shuffle(labels)
for lang in [src_lang, tgt_lang]:
train_f = open(os.path.join(data_dir, dataset_name, f'new_train.clean.{lang}'), 'w')
valid_f = open(os.path.join(data_dir, dataset_name, f'new_valid.clean.{lang}'), 'w')
test_f = open(os.path.join(data_dir, dataset_name, f'new_test.clean.{lang}'), 'w')
count = 0
for line in open(f'{data_prefix}.clean.{lang}', 'r'):
p = labels[count] / line_num
if p < train_ratio:
train_f.write(line)
elif p < train_ratio + valid_ratio:
valid_f.write(line)
else:
test_f.write(line)
count += 1
train_f.close()
valid_f.close()
test_f.close()
```
```=
if (prefix/f'new_train.clean.{src_lang}').exists() \
and (prefix/f'new_train.clean.{tgt_lang}').exists() \
and (prefix/f'new_valid.clean.{src_lang}').exists() \
and (prefix/f'new_valid.clean.{tgt_lang}').exists() \
and (prefix/f'new_test.clean.{src_lang}').exists() \
and (prefix/f'new_test.clean.{tgt_lang}').exists():
print(f'train/valid/test splits exists. skipping split.')
for split in ['new_train', 'new_valid', 'new_test']:
for lang in [src_lang, tgt_lang]:
file_path = prefix / f'{split}.clean.{lang}'
with open(file_path, 'r') as f:
line_count = sum(1 for _ in f)
print(f'{split}.{lang}: {line_count} lines')
```

```=
# 訓練 SentencePiece 子詞模型
# 解決機器翻譯中的 OOV(Out-Of-Vocabulary)問題
# 在機器翻譯中,遇到「未曾見過的單字」是常見問題(例如某個新名詞、拼錯的字等)
# 用 subword units(子詞單位) 把單字拆成更小的單位(例如字根、詞綴、字母)可以緩解這個問題
# import sentencepiece as spm
# 引入 Google 開發的 sentencepiece 套件,用來訓練 subword 模型。
# 它支援常見的演算法,如 unigram(機率模型)和 BPE(Byte Pair Encoding,基於頻率合併)
import sentencepiece as spm
vocab_size = 8000
if (prefix/f'spm{vocab_size}.model').exists():
print(f'{prefix}/spm{vocab_size}.model exists. skipping spm_train.')
else:
spm.SentencePieceTrainer.train(
input=','.join([f'{prefix}/new_train.clean.{src_lang}',
f'{prefix}/new_valid.clean.{src_lang}',
f'{prefix}/new_train.clean.{tgt_lang}',
f'{prefix}/new_valid.clean.{tgt_lang}']),
model_prefix=prefix/f'spm{vocab_size}', # 最後會生成兩個檔案:spm8000.model(模型)、spm8000.vocab(詞表)
vocab_size=vocab_size,
character_coverage=1,
model_type='unigram', # 概率模型 or 'bpe'
input_sentence_size=1e6, # 只隨機取一百萬句來訓練,足夠且加快速度
shuffle_input_sentence=True,
normalization_rule_name='nmt_nfkc_cf', # 特別為機器翻譯設計的正規化規則,會處理全形轉半形、大小寫統一等
)
```
```=
spm_model = spm.SentencePieceProcessor(model_file=str(prefix/f'spm{vocab_size}.model'))
in_tag = {
'train': 'new_train.clean',
'valid': 'new_valid.clean',
'test': 'new_test.clean',
'test_ori': 'test.raw.clean',
}
for split in ['train', 'valid', 'test', 'test_ori']:
for lang in [src_lang, tgt_lang]:
out_path = prefix/f'{split}.{lang}'
if out_path.exists():
print(f"{out_path} exists. skipping spm_encode.")
else:
with open(prefix/f'{split}.{lang}', 'w') as out_f:
with open(prefix/f'{in_tag[split]}.{lang}', 'r') as in_f:
for line in in_f:
line = line.strip()
tok = spm_model.encode(line, out_type=str)
print(' '.join(tok), file=out_f)
```

Binarize the data with fairseq
```=
# Fairseq 所需的二進位格式
# prep 姿料來源
from pathlib import Path
binpath = Path('/workspace/DATA/data-bin', dataset_name)
if binpath.exists():
print(binpath, "exists, will not overwrite!")
else:
cmd = f"""
python -m fairseq_cli.preprocess \\
--source-lang {src_lang} \\
--target-lang {tgt_lang} \\
--trainpref {prefix}/train \\
--validpref {prefix}/valid \\
--testpref {prefix}/test \\
--destdir {binpath} \\
--joined-dictionary \\
--workers 2
"""
print("Running command:\n", cmd)
os.system(cmd) # 指令在 Python 內直接執行
```
Configuration for Experiments
```=
# 使用 Fairseq 訓練翻譯模型時的config
config = Namespace(
datadir = "/workspace/DATA/data-bin/ted2020",
savedir = "/workspace/checkpoints/rnn",
source_lang = "en",
target_lang = "zh",
num_workers=2,
max_tokens=8192,
accum_steps=2, # gradient accumulation 次數
# 學習率與優化器
lr_factor=2.,
lr_warmup=4000,
# 梯度控制
clip_norm=1.0,
# 訓練輪數控制
max_epoch=30,
start_epoch=1,
# beam size。數值越大,翻譯質量可能提升,但速度變慢
beam=5,
# 翻譯句子長度上限是 1.2 * 原句長度 + 10
max_len_a=1.2,
max_len_b=10,
# 會自動去除 ▁、</s> 這些符號
post_process = "sentencepiece",
# 保留最近 5 個 epoch 的 checkpoint
keep_last_epochs=5,
resume=None, # 從頭訓練
# logging 若設為 True,需先 pip install wandb 並登入帳號
use_wandb=False,
)
```
Logging
```=
# log 輸出格式
logging.basicConfig(
format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
level="INFO", # "DEBUG" "WARNING" "ERROR"
stream=sys.stdout,
)
proj = "hw5.seq2seq"
logger = logging.getLogger(proj)
if config.use_wandb:
import wandb
wandb.init(project=proj, name=Path(config.savedir).stem, config=config)
```
```=
logger.info("Start training...")
logger.warning("Something may go wrong.")
```
CUDA Environment
```=
cuda_env = utils.CudaEnvironment()
utils.CudaEnvironment.pretty_print_cuda_env_list([cuda_env])
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
```
borrow the TranslationTask from fairseq
```=
# 設定與測試 fairseq 中的 TranslationTask,用來讀取先前 binarized 的資料 (data-bin)
from fairseq.tasks.translation import TranslationConfig, TranslationTask
#
task_cfg = TranslationConfig(
data=config.datadir, # 資料目錄 (data-bin 資料)
source_lang=config.source_lang, # 原始語言 (例如 "en")
target_lang=config.target_lang, # 目標語言 (例如 "zh")
train_subset="train", # 訓練集的分割名稱
required_seq_len_multiple=8, # 對齊 padding 長度以加速訓練
dataset_impl="mmap", # 使用 memory map 加快讀取
upsample_primary=1, # 是否放大原始資料的比例(多數只用 1)
)
# 根據 task_cfg 建立翻譯任務,裡面會內建好:
# 資料載入方法
# task.source_dictionary 與 task.target_dictionary 物件
# beam search decoder
task = TranslationTask.setup_task(task_cfg)
```
```=
# 從 data-bin/ted2020 這個資料夾中讀取 train.en-zh.en 和 train.en-zh.zh 這兩個檔案組成的資料集
logger.info("loading data for epoch 1")
task.load_dataset(split="train", epoch=1, combine=True)
task.load_dataset(split="valid", epoch=1)
```
```=
sample = task.dataset("valid")[1]
pprint.pprint(sample)
pprint.pprint(
"Source: " + \
task.source_dictionary.string(
sample['source'],
config.post_process,
)
)
pprint.pprint(
"Target: " + \
task.target_dictionary.string(
sample['target'],
config.post_process,
)
)
```

Dataset Iterator
```=
def load_data_iterator(task, split, epoch=1, max_tokens=4000, num_workers=1, cached=True):
batch_iterator = task.get_batch_iterator(
dataset=task.dataset(split),
max_tokens=max_tokens,
max_sentences=None,
max_positions=utils.resolve_max_positions(
task.max_positions(),
max_tokens,
),
ignore_invalid_inputs=True,
seed=seed,
num_workers=num_workers,
epoch=epoch,
disable_iterator_cache=not cached,
# Set this to False to speed up. However, if set to False, changing max_tokens beyond
# first call of this method has no effect.
)
return batch_iterator
demo_epoch_obj = load_data_iterator(task, "valid", epoch=1, max_tokens=20, num_workers=1, cached=False)
demo_iter = demo_epoch_obj.next_epoch_itr(shuffle=True)
sample = next(demo_iter)
sample
```

```=
batch = {
"id": id, # id for each example
"nsentences": len(samples), # batch size (sentences)
"ntokens": ntokens, # batch size (tokens)
"net_input": {
"src_tokens": src_tokens, # sequence in source language
"src_lengths": src_lengths, # sequence length of each example before padding
"prev_output_tokens": prev_output_tokens, # right shifted target, as mentioned above.
},
"target": target, # target sequence
}
```
Model Architecture
```=
from fairseq.models import (
FairseqEncoder,
FairseqIncrementalDecoder,
FairseqEncoderDecoderModel
)
```
Enconder
```=
class RNNEncoder(FairseqEncoder):
def __init__(self, args, dictionary, embed_tokens):
super().__init__(dictionary)
self.embed_tokens = embed_tokens # 接收已建立好的 nn.Embedding,把 token ID 轉成向量
# 儲存嵌入層(nn.Embedding),負責把 token ID 轉成向量
self.embed_dim = args.encoder_embed_dim
self.hidden_dim = args.encoder_ffn_embed_dim
self.num_layers = args.encoder_layers
self.dropout_in_module = nn.Dropout(args.dropout)
# 把參數存在實例中
self.rnn = nn.GRU(
self.embed_dim, # embedding 輸出維度
self.hidden_dim, # GRU 隱藏層的維度
self.num_layers, # GRU 疊了幾層
dropout=args.dropout, # 正則,隨機把一些神經元的輸出變成 0(即暫時不使用它們)
batch_first=False, # 輸入格式是 [seq_len, batch, dim]
bidirectional=True # 雙向 RNN,會輸出 forward 和 backward 的結果
)
self.dropout_out_module = nn.Dropout(args.dropout)
# 到 padding 的 token index,之後做 masking 會用
self.padding_idx = dictionary.pad()
# 將 bidirectional RNN 的 final_hiddens 合併起來(因為有 forward 和 backward 兩個方向)
def combine_bidir(self, outs, bsz: int):
out = outs.view(self.num_layers, 2, bsz, -1).transpose(1, 2).contiguous() # 方向和 batch 維度對調
return out.view(self.num_layers, bsz, -1) # 最後合併兩個方向 → hidden * 2
# 取得 batch size 和句子長度
def forward(self, src_tokens, **unused):
bsz, seqlen = src_tokens.size()
# 將每個 token 轉為向量並加上 dropout。
x = self.embed_tokens(src_tokens)
x = self.dropout_in_module(x)
# B x T x C -> T x B x C
# 改成 RNN 所需的格式 [seq_len, batch_size, embed_dim]
x = x.transpose(0, 1)
# pass thru bidirectional RNN
# 雙向 → 要初始化 2 × num_layers 個 hidden state
# 把嵌入後的輸入與初始 hidden state 丟進 GRU
# GRU 輸出後加 dropout
h0 = x.new_zeros(2 * self.num_layers, bsz, self.hidden_dim)
x, final_hiddens = self.rnn(x, h0)
outputs = self.dropout_out_module(x)
# outputs = [sequence len, batch size, hid dim * directions]
# hidden = [num_layers * directions, batch size , hid dim]
# 編碼器是雙向的,我們需要連接兩個方向的隱藏狀態
# forward + backward 的 hidden state 合併
final_hiddens = self.combine_bidir(final_hiddens, bsz)
# hidden = [num_layers x batch x num_directions*hidden]
# 建立 padding mask
encoder_padding_mask = src_tokens.eq(self.padding_idx).t()
return tuple(
(
outputs, # seq_len x batch x hidden
final_hiddens, # num_layers x batch x num_directions*hidden
encoder_padding_mask, # seq_len x batch
)
)
# 為了配合 beam search,在 decoder 要根據 beam 的排序重新安排 encoder 輸出
# 對 outputs, final_hiddens, mask 都按照 new_order 重新排列
def reorder_encoder_out(self, encoder_out, new_order):
return tuple(
(
encoder_out[0].index_select(1, new_order),
encoder_out[1].index_select(1, new_order),
encoder_out[2].index_select(1, new_order),
)
)
```
Attention
```=
# bias: 線性層是否包含偏置項
class AttentionLayer(nn.Module):
def __init__(self, input_embed_dim, source_embed_dim, output_embed_dim, bias=False):
super().__init__()
self.input_proj = nn.Linear(input_embed_dim, source_embed_dim, bias=bias)
self.output_proj = nn.Linear(
input_embed_dim + source_embed_dim, output_embed_dim, bias=bias
)
# 前向傳播(Forward 方法)
def forward(self, inputs, encoder_outputs, encoder_padding_mask):
# inputs: T, B, dim
# encoder_outputs: S x B x dim,(S, B, dim),其中 S 是源序列長度,dim 是 source_embed_dim
# padding mask: S x B,形狀為 (S, B),用於標記源序列中的填充位置(True 表示填充,False 表示有效)
# convert all to batch first
inputs = inputs.transpose(1,0) # 從 (T, B, dim)(序列優先)轉換為 (B, T, dim)(批次優先)
encoder_outputs = encoder_outputs.transpose(1,0) # B, S, dim
encoder_padding_mask = encoder_padding_mask.transpose(1,0) # B, S
# 將解碼器輸入(inputs)通過 input_proj 線性層投影到與編碼器輸出相同的維度
x = self.input_proj(inputs)
# 計算注意力分數
# (B, T, dim) x (B, dim, S) = (B, T, S)
attn_scores = torch.bmm(x, encoder_outputs.transpose(1,2))
# 如果提供了 encoder_padding_mask,則處理填充位置的注意力分數
if encoder_padding_mask is not None:
# leveraging broadcast B, S -> (B, 1, S)
# 將填充位置的注意力分數設置為負無窮(float("-inf")),以確保後續 softmax 時這些位置的權重為 0
encoder_padding_mask = encoder_padding_mask.unsqueeze(1)
attn_scores = (
attn_scores.float()
.masked_fill_(encoder_padding_mask, float("-inf"))
.type_as(attn_scores)
) # FP16 support: cast to float and back
# 對注意力分數應用 softmax,沿著最後一維(源序列維度 S)進行歸一化,得到注意力權重
attn_scores = F.softmax(attn_scores, dim=-1)
# 使用注意力權重對編碼器輸出進行加權求和,計算上下文向量
x = torch.bmm(attn_scores, encoder_outputs)
# (B, T, dim)
# 將上下文向量(x)與原始解碼器輸入(inputs)沿著最後一維(特徵維度)拼接
x = torch.cat((x, inputs), dim=-1)
x = torch.tanh(self.output_proj(x)) # concat + linear + tanh
# restore shape (B, T, dim) -> (T, B, dim)
# 將拼接後的向量通過 output_proj 線性層投影到 output_embed_dim,並應用 tanh 激活函數
return x.transpose(1,0), attn_scores
```
Decoder
```=
class RNNDecoder(FairseqIncrementalDecoder):
def __init__(self, args, dictionary, embed_tokens):
super().__init__(dictionary)
self.embed_tokens = embed_tokens
# 斷言檢查,確保編碼器和解碼器的 RNN 層數相同
assert args.decoder_layers == args.encoder_layers, f"""seq2seq rnn requires that encoder
and decoder have same layers of rnn. got: {args.encoder_layers, args.decoder_layers}"""
# 斷言檢查,確保解碼器的隱藏層維度(decoder_ffn_embed_dim)是編碼器隱藏層維度的兩倍
assert args.decoder_ffn_embed_dim == args.encoder_ffn_embed_dim*2, f"""seq2seq-rnn requires
that decoder hidden to be 2*encoder hidden dim. got: {args.decoder_ffn_embed_dim, args.encoder_ffn_embed_dim*2}"""
# 從 args 中提取並保存解碼器的超參數
self.embed_dim = args.decoder_embed_dim
self.hidden_dim = args.decoder_ffn_embed_dim
self.num_layers = args.decoder_layers
# 創建一個單向 GRU 層
self.dropout_in_module = nn.Dropout(args.dropout)
self.rnn = nn.GRU(
self.embed_dim,
self.hidden_dim,
self.num_layers,
dropout=args.dropout,
batch_first=False,
bidirectional=False
)
# 創建一個 AttentionLayer 實例(參見你之前提供的注意力層代碼
self.attention = AttentionLayer(
self.embed_dim, self.hidden_dim, self.embed_dim, bias=False
)
self.dropout_out_module = nn.Dropout(args.dropout)
# 如果 GRU 隱藏維度(hidden_dim)與嵌入維度(embed_dim)不同,創建一個線性層將隱藏狀態投影到嵌入維度;否則設置為 None
if self.hidden_dim != self.embed_dim:
self.project_out_dim = nn.Linear(self.hidden_dim, self.embed_dim)
else:
self.project_out_dim = None
# 如果 args.share_decoder_input_output_embed 為 True,則共享輸入嵌入層(embed_tokens)和輸出投影層的權重
# 如果不共享權重,則創建一個獨立的線性層,將輸出維度從 output_embed_dim(等於 embed_dim)投影到詞彙表大小
if args.share_decoder_input_output_embed:
self.output_projection = nn.Linear(
self.embed_tokens.weight.shape[1],
self.embed_tokens.weight.shape[0],
bias=False,
)
self.output_projection.weight = self.embed_tokens.weight
else:
self.output_projection = nn.Linear(
self.output_embed_dim, len(dictionary), bias=False
)
nn.init.normal_(
self.output_projection.weight, mean=0, std=self.output_embed_dim ** -0.5
)
# 定義前向傳播函數
def forward(self, prev_output_tokens, encoder_out, incremental_state=None, **unused):
# extract the outputs from encoder
encoder_outputs, encoder_hiddens, encoder_padding_mask = encoder_out
# outputs: seq_len x batch x num_directions*hidden
# encoder_hiddens: num_layers x batch x num_directions*encoder_hidden
# padding_mask: seq_len x batch
# 如果 incremental_state 存在(推理時),則只使用最後一個 token(prev_output_tokens[:, -1:]),
# 並從 incremental_state 中恢復上一步的隱藏狀態(prev_hiddens)。
# 否則(訓練或推理的第一步),使用編碼器的最終隱藏狀態(encoder_hiddens)作為初始隱藏狀態
if incremental_state is not None and len(incremental_state) > 0:
# if the information from last timestep is retained, we can continue from there instead of starting from bos
prev_output_tokens = prev_output_tokens[:, -1:]
cache_state = self.get_incremental_state(incremental_state, "cached_state")
prev_hiddens = cache_state["prev_hiddens"]
else:
# incremental state does not exist, either this is training time, or the first timestep of test time
# prepare for seq2seq: pass the encoder_hidden to the decoder hidden states
# 獲取批次大小(bsz)和序列長度(seqlen)
prev_hiddens = encoder_hiddens
bsz, seqlen = prev_output_tokens.size()
# embed tokens
x = self.embed_tokens(prev_output_tokens)
x = self.dropout_in_module(x)
# B x T x C -> T x B x C
# 將嵌入向量的維度從 (batch_size, seq_len, embed_dim) 轉為 (seq_len, batch_size, embed_dim)
x = x.transpose(0, 1)
# decoder-to-encoder attention
# 如果使用了注意力層,則調用 AttentionLayer 計算上下文向量和注意力分數
if self.attention is not None:
x, attn = self.attention(x, encoder_outputs, encoder_padding_mask)
# pass thru unidirectional RNN
# 將注意力輸出(或嵌入)輸入單向 GRU,生成新的隱藏狀態
x, final_hiddens = self.rnn(x, prev_hiddens)
# outputs = [sequence len, batch size, hid dim]
# hidden = [num_layers * directions, batch size , hid dim]
x = self.dropout_out_module(x)
# 如果 hidden_dim 與 embed_dim 不同,則通過 project_out_dim 線性層將 GRU 輸出投影到 embed_dim
if self.project_out_dim != None:
x = self.project_out_dim(x)
# 通過 output_projection 線性層將輸出投影到詞彙表大小
x = self.output_projection(x)
# T x B x C -> B x T x C
# 將輸出從序列優先 (seq_len, batch_size, vocab_size) 轉為批次優先 (batch_size, seq_len, vocab_size)
x = x.transpose(1, 0)
# 如果是增量解碼,保存當前時間步的隱藏狀態(final_hiddens)到 incremental_state 中
cache_state = {
"prev_hiddens": final_hiddens,
}
self.set_incremental_state(incremental_state, "cached_state", cache_state)
return x, None
# 定義一個方法,用於在束搜索(beam search)中重排序增量狀態
def reorder_incremental_state(
self,
incremental_state,
new_order,
):
# 定義一個方法,用於在束搜索(beam search)中重排序增量狀態
cache_state = self.get_incremental_state(incremental_state, "cached_state")
prev_hiddens = cache_state["prev_hiddens"]
prev_hiddens = [p.index_select(0, new_order) for p in prev_hiddens]
# 更新緩存狀態並保存到 incremental_state
cache_state = {
"prev_hiddens": torch.stack(prev_hiddens),
}
self.set_incremental_state(incremental_state, "cached_state", cache_state)
return
```
Seq2Seq
```=
# 序列到序列(Seq2Seq)模型
# Fairseq 的 FairseqEncoderDecoderModel,結構由 編碼器(Encoder)+ 解碼器(Decoder) 組成
class Seq2Seq(FairseqEncoderDecoderModel):
def __init__(self, args, encoder, decoder):
super().__init__(encoder, decoder)
self.args = args
def forward(
self,
src_tokens,
src_lengths,
prev_output_tokens,
return_all_hiddens: bool = True,
):
"""
Run the forward pass for an encoder-decoder model.
"""
# Encoder 編碼輸入
encoder_out = self.encoder(
src_tokens, src_lengths=src_lengths, return_all_hiddens=return_all_hiddens
)
# Decoder 根據 encoder 輸出產生 logits(分類分數)
logits, extra = self.decoder(
prev_output_tokens,
encoder_out=encoder_out,
src_lengths=src_lengths,
return_all_hiddens=return_all_hiddens,
)
return logits, extra
```
Model Initialization
```=
# # HINT: transformer architecture
# from fairseq.models.transformer import (
# TransformerEncoder,
# TransformerDecoder,
# )
def build_model(args, task):
""" build a model instance based on hyperparameters """
src_dict, tgt_dict = task.source_dictionary, task.target_dictionary
# token embeddings
encoder_embed_tokens = nn.Embedding(len(src_dict), args.encoder_embed_dim, src_dict.pad())
decoder_embed_tokens = nn.Embedding(len(tgt_dict), args.decoder_embed_dim, tgt_dict.pad())
# encoder decoder
# HINT: TODO: switch to TransformerEncoder & TransformerDecoder
encoder = RNNEncoder(args, src_dict, encoder_embed_tokens)
decoder = RNNDecoder(args, tgt_dict, decoder_embed_tokens)
# sequence to sequence model
model = Seq2Seq(args, encoder, decoder)
# initialization for seq2seq model is important, requires extra handling
def init_params(module):
from fairseq.modules import MultiheadAttention
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=0.02)
if module.bias is not None:
module.bias.data.zero_()
if isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=0.02)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
if isinstance(module, MultiheadAttention):
module.q_proj.weight.data.normal_(mean=0.0, std=0.02)
module.k_proj.weight.data.normal_(mean=0.0, std=0.02)
module.v_proj.weight.data.normal_(mean=0.0, std=0.02)
if isinstance(module, nn.RNNBase):
for name, param in module.named_parameters():
if "weight" in name or "bias" in name:
param.data.uniform_(-0.1, 0.1)
# weight initialization
model.apply(init_params)
return model
```
Architecture Related Configuration
```=
arch_args = Namespace(
encoder_embed_dim=256,
encoder_ffn_embed_dim=512,
encoder_layers=1,
decoder_embed_dim=256,
decoder_ffn_embed_dim=1024,
decoder_layers=1,
share_decoder_input_output_embed=True,
dropout=0.3,
)
# # HINT: these patches on parameters for Transformer
# def add_transformer_args(args):
# args.encoder_attention_heads=4
# args.encoder_normalize_before=True
# args.decoder_attention_heads=4
# args.decoder_normalize_before=True
# args.activation_fn="relu"
# args.max_source_positions=1024
# args.max_target_positions=1024
# # patches on default parameters for Transformer (those not set above)
# from fairseq.models.transformer import base_architecture
# base_architecture(arch_args)
# add_transformer_args(arch_args)
```
```=
if config.use_wandb:
wandb.config.update(vars(arch_args))
```
```=
model = build_model(arch_args, task)
logger.info(model)
```
Loss: Label Smoothing Regularization
```=
class LabelSmoothedCrossEntropyCriterion(nn.Module):
def __init__(self, smoothing, ignore_index=None, reduce=True):
super().__init__()
self.smoothing = smoothing
self.ignore_index = ignore_index
self.reduce = reduce
def forward(self, lprobs, target):
if target.dim() == lprobs.dim() - 1:
target = target.unsqueeze(-1)
# nll: Negative log likelihood,the cross-entropy when target is one-hot. following line is same as F.nll_loss
nll_loss = -lprobs.gather(dim=-1, index=target)
# reserve some probability for other labels. thus when calculating cross-entropy,
# equivalent to summing the log probs of all labels
smooth_loss = -lprobs.sum(dim=-1, keepdim=True)
if self.ignore_index is not None:
pad_mask = target.eq(self.ignore_index)
nll_loss.masked_fill_(pad_mask, 0.0)
smooth_loss.masked_fill_(pad_mask, 0.0)
else:
nll_loss = nll_loss.squeeze(-1)
smooth_loss = smooth_loss.squeeze(-1)
if self.reduce:
nll_loss = nll_loss.sum()
smooth_loss = smooth_loss.sum()
# when calculating cross-entropy, add the loss of other labels
eps_i = self.smoothing / lprobs.size(-1)
loss = (1.0 - self.smoothing) * nll_loss + eps_i * smooth_loss
return loss
# generally, 0.1 is good enough
criterion = LabelSmoothedCrossEntropyCriterion(
smoothing=0.1,
ignore_index=task.target_dictionary.pad(),
)
```
Optimizer: Adam + lr scheduling
```=
# 包裝器(wrapper),包在一個 optimizer(如 Adam)外層,它根據步數來動態改變學習率,而不是用固定的學習率
class NoamOpt:
"Optim wrapper that implements rate."
def __init__(self, model_size, factor, warmup, optimizer):
self.optimizer = optimizer
self._step = 0
self.warmup = warmup
self.factor = factor
self.model_size = model_size
self._rate = 0
@property
def param_groups(self):
return self.optimizer.param_groups
def multiply_grads(self, c):
"""Multiplies grads by a constant *c*."""
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
p.grad.data.mul_(c)
# 更新學習率並前進一步
def step(self):
"Update parameters and rate"
self._step += 1
rate = self.rate()
for p in self.param_groups:
p['lr'] = rate
self._rate = rate
self.optimizer.step()
# 計算當前步數的學習率
def rate(self, step = None):
"Implement `lrate` above"
if step is None:
step = self._step
return 0 if not step else self.factor * \
(self.model_size ** (-0.5) *
min(step ** (-0.5), step * self.warmup ** (-1.5)))
```
Scheduling Visualized
```=
# 這樣的設計在 Transformer 模型中廣泛使用,因為它能幫助模型穩定訓練
optimizer = NoamOpt(
model_size=arch_args.encoder_embed_dim,
factor=config.lr_factor,
warmup=config.lr_warmup,
optimizer=torch.optim.AdamW(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9, weight_decay=0.0001))
plt.plot(np.arange(1, 100000), [optimizer.rate(i) for i in range(1, 100000)])
plt.legend([f"{optimizer.model_size}:{optimizer.warmup}"])
None
```

【Training Procedure】
Training
```=
from fairseq.data import iterators
from torch.cuda.amp import GradScaler, autocast
def train_one_epoch(epoch_itr, model, task, criterion, optimizer, accum_steps=1):
# 每次讀取一個新的 epoch,打亂順序(shuffle=True)
# 每 accum_steps 個 sample 才做一次參數更新(gradient accumulation)
itr = epoch_itr.next_epoch_itr(shuffle=True)
itr = iterators.GroupedIterator(itr, accum_steps)
# AMP + 梯度累積訓練
stats = {"loss": []}
scaler = GradScaler() # automatic mixed precision (amp)
model.train()
progress = tqdm.tqdm(itr, desc=f"train epoch {epoch_itr.epoch}", leave=False)
for samples in progress:
model.zero_grad()
accum_loss = 0
sample_size = 0
# gradient accumulation: update every accum_steps samples
for i, sample in enumerate(samples):
if i == 1:
# emptying the CUDA cache after the first step can reduce the chance of OOM
torch.cuda.empty_cache()
sample = utils.move_to_cuda(sample, device=device)
target = sample["target"]
sample_size_i = sample["ntokens"]
sample_size += sample_size_i
# mixed precision training
with autocast():
net_output = model.forward(**sample["net_input"])
lprobs = F.log_softmax(net_output[0], -1)
loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1))
# logging
accum_loss += loss.item()
# back-prop
scaler.scale(loss).backward()
# 梯度更新
scaler.unscale_(optimizer)
optimizer.multiply_grads(1 / (sample_size or 1.0)) # (sample_size or 1.0) handles the case of a zero gradient
gnorm = nn.utils.clip_grad_norm_(model.parameters(), config.clip_norm) # grad norm clipping prevents gradient exploding
scaler.step(optimizer)
scaler.update()
# logging
loss_print = accum_loss/sample_size
stats["loss"].append(loss_print)
progress.set_postfix(loss=loss_print)
if config.use_wandb:
wandb.log({
"train/loss": loss_print,
"train/grad_norm": gnorm.item(),
"train/lr": optimizer.rate(),
"train/sample_size": sample_size,
})
loss_print = np.mean(stats["loss"])
logger.info(f"training loss: {loss_print:.4f}")
return stats
```
Validation & Inference
```=
# fairseq's beam search generator
# given model and input seqeunce, produce translation hypotheses by beam search
# # 利用 beam search 做翻譯推論
sequence_generator = task.build_generator([model], config)
def decode(toks, dictionary):
# 這個 helper function 用來把張量(token IDs)轉回人類可讀的文字
s = dictionary.string(
toks.int().cpu(),
config.post_process,
)
return s if s else "<unk>"
# 這行執行了 beam search 解碼,根據 sample 輸入,返回每個樣本的候選翻譯
# 輸出來源(src)、翻譯(hyp)、參考(ref)
def inference_step(sample, model):
gen_out = sequence_generator.generate([model], sample)
srcs = []
hyps = []
refs = []
for i in range(len(gen_out)):
# for each sample, collect the input, hypothesis and reference, later be used to calculate BLEU
srcs.append(decode(
utils.strip_pad(sample["net_input"]["src_tokens"][i], task.source_dictionary.pad()),
task.source_dictionary,
))
hyps.append(decode(
gen_out[i][0]["tokens"], # 0 indicates using the top hypothesis in beam
task.target_dictionary,
))
refs.append(decode(
utils.strip_pad(sample["target"][i], task.target_dictionary.pad()),
task.target_dictionary,
))
return srcs, hyps, refs
```
```=
import shutil
import sacrebleu
def validate(model, task, criterion, log_to_wandb=True):
logger.info('begin validation')
# 建立驗證集 iterator
# batch size = 1(推論階段通常不會做 batching)
itr = load_data_iterator(task, "valid", 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False)
stats = {"loss":[], "bleu": 0, "srcs":[], "hyps":[], "refs":[]}
srcs = []
hyps = []
refs = []
model.eval()
progress = tqdm.tqdm(itr, desc=f"validation", leave=False)
with torch.no_grad():
for i, sample in enumerate(progress):
# validation loss
# 前處理與 forward 推論
sample = utils.move_to_cuda(sample, device=device)
net_output = model.forward(**sample["net_input"])
lprobs = F.log_softmax(net_output[0], -1)
target = sample["target"]
sample_size = sample["ntokens"]
# 計算損失
loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1)) / sample_size
progress.set_postfix(valid_loss=loss.item())
stats["loss"].append(loss)
# Beam search 推論 + decode
s, h, r = inference_step(sample, model)
srcs.extend(s)
hyps.extend(h)
refs.extend(r)
# BLEU 分數評估
tok = 'zh' if task.cfg.target_lang == 'zh' else '13a'
stats["loss"] = torch.stack(stats["loss"]).mean().item()
stats["bleu"] = sacrebleu.corpus_bleu(hyps, [refs], tokenize=tok) # 計算BLEU score
stats["srcs"] = srcs
stats["hyps"] = hyps
stats["refs"] = refs
# WandB logging(可選)
if config.use_wandb and log_to_wandb:
wandb.log({
"valid/loss": stats["loss"],
"valid/bleu": stats["bleu"].score,
}, commit=False)
# 印出隨機一筆翻譯範例(用來 debug/展示)
showid = np.random.randint(len(hyps))
logger.info("example source: " + srcs[showid])
logger.info("example hypothesis: " + hyps[showid])
logger.info("example reference: " + refs[showid])
#
logger.info(f"validation loss:\t{stats['loss']:.4f}")
logger.info(stats["bleu"].format())
return stats
```
Save and Load Model Weights
```=
from pathlib import Path
# 假設 config.savedir 是 '/workspace/checkpoints/rnn'
config.savedir = '/workspace/checkpoints/rnn' # 如果 config.savedir 還沒設置
savedir = Path(config.savedir).absolute()
# 確認路徑
print(savedir)
# 創建資料夾(如果不存在)
savedir.mkdir(parents=True, exist_ok=True)
```
```=
# 呼叫前面寫的 validate() 函數
def validate_and_save(model, task, criterion, optimizer, epoch, save=True):
stats = validate(model, task, criterion)
bleu = stats['bleu']
loss = stats['loss']
if save:
# save epoch checkpoints
config.savedir = '/workspace/checkpoints/rnn'
savedir = Path(config.savedir).absolute()
savedir.mkdir(parents=True, exist_ok=True)
print(savedir)
# 儲存 checkpoint,包括:
# 模型參數
# 當前 epoch 的 BLEU 和 loss
# 優化器的 step 數(這是你自訂的 _step)
check = {
"model": model.state_dict(),
"stats": {"bleu": bleu.score, "loss": loss},
"optim": {"step": optimizer._step}
}
# 可選保留最新的模型或歷史版本
torch.save(check, savedir/f"checkpoint{epoch}.pt")
shutil.copy(savedir/f"checkpoint{epoch}.pt", savedir/f"checkpoint_last.pt")
logger.info(f"saved epoch checkpoint: {savedir}/checkpoint{epoch}.pt")
# 儲存翻譯輸出(source + hypothesis)s
with open(savedir/f"samples{epoch}.{config.source_lang}-{config.target_lang}.txt", "w") as f:
for s, h in zip(stats["srcs"], stats["hyps"]):
f.write(f"{s}\t{h}\n")
# 儲存最佳模型(highest BLEU)
if getattr(validate_and_save, "best_bleu", 0) < bleu.score:
validate_and_save.best_bleu = bleu.score
torch.save(check, savedir/f"checkpoint_best.pt")
# 刪除舊模型(只保留最新 N 個)
del_file = savedir / f"checkpoint{epoch - config.keep_last_epochs}.pt"
if del_file.exists():
del_file.unlink()
return stats
def try_load_checkpoint(model, optimizer=None, name=None):
name = name if name else "checkpoint_last.pt"
checkpath = Path(config.savedir)/name
# 載入上次訓練的模型狀態
if checkpath.exists():
check = torch.load(checkpath)
model.load_state_dict(check["model"])
stats = check["stats"]
step = "unknown"
if optimizer != None:
optimizer._step = step = check["optim"]["step"]
logger.info(f"loaded checkpoint {checkpath}: step={step} loss={stats['loss']} bleu={stats['bleu']}")
else:
logger.info(f"no checkpoints found at {checkpath}!")
```
【Main】
Training loop
```=
model = model.to(device=device)
criterion = criterion.to(device=device)
```
```=
# 設定用來確認訓練任務的基本資訊
logger.info("task: {}".format(task.__class__.__name__))
logger.info("encoder: {}".format(model.encoder.__class__.__name__))
logger.info("decoder: {}".format(model.decoder.__class__.__name__))
logger.info("criterion: {}".format(criterion.__class__.__name__))
logger.info("optimizer: {}".format(optimizer.__class__.__name__))
logger.info(
"num. model params: {:,} (num. trained: {:,})".format(
sum(p.numel() for p in model.parameters()),
sum(p.numel() for p in model.parameters() if p.requires_grad),
)
)
logger.info(f"max tokens per batch = {config.max_tokens}, accumulate steps = {config.accum_steps}")
```
```=
print(f"Start from epoch: {config.start_epoch}")
print(f"Train until epoch: {config.max_epoch}")
print(f"Total epochs: {config.max_epoch - config.start_epoch + 1}")
```

```=
# 開始訓練
epoch_itr = load_data_iterator(task, "train", config.start_epoch, config.max_tokens, config.num_workers)
try_load_checkpoint(model, optimizer, name=config.resume)
while epoch_itr.next_epoch_idx <= config.max_epoch:
# train for one epoch
train_one_epoch(epoch_itr, model, task, criterion, optimizer, config.accum_steps)
stats = validate_and_save(model, task, criterion, optimizer, epoch=epoch_itr.epoch)
logger.info("end of epoch {}".format(epoch_itr.epoch))
epoch_itr = load_data_iterator(task, "train", epoch_itr.next_epoch_idx, config.max_tokens, config.num_workers)
```
Submission
```=
# 使用 Fairseq 官方提供的腳本來做「模型參數平均」,也就是將最近的 5 個 checkpoint 做平均,以提升模型穩定性和 BLEU 分數
checkdir=config.savedir
!python ./fairseq/scripts/average_checkpoints.py \
--inputs {checkdir} \
--num-epoch-checkpoints 5 \
--output {checkdir}/avg_last_5_checkpoint.pt
```
Confirm model weights used to generate submission
```=
# checkpoint_last.pt : latest epoch
# checkpoint_best.pt : highest validation bleu
# avg_last_5_checkpoint.pt: the average of last 5 epochs
# config.savedir/avg_last_5_checkpoint.pt 載入模型參數
# 但這裡沒帶 optimizer,所以會忽略掉 optimizer 狀態,沒關係,如果只是 inference 沒問題。
try_load_checkpoint(model, name="avg_last_5_checkpoint.pt")
validate(model, task, criterion, log_to_wandb=False)
None
```
Generate Prediction
```=
def generate_prediction(model, task, split="test", outfile="./prediction.txt"):
task.load_dataset(split=split, epoch=1)
itr = load_data_iterator(task, split, 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False)
idxs = []
hyps = []
model.eval()
progress = tqdm.tqdm(itr, desc=f"prediction")
with torch.no_grad():
for i, sample in enumerate(progress):
# validation loss
sample = utils.move_to_cuda(sample, device=device)
# do inference
s, h, r = inference_step(sample, model)
hyps.extend(h)
idxs.extend(list(sample['id']))
# sort based on the order before preprocess
hyps = [x for _,x in sorted(zip(idxs,hyps))]
with open(outfile, "w") as f:
for h in hyps:
f.write(h+"\n")
```
```=
generate_prediction(model, task)
```
```=
with open(data_dir + '/' + dataset_name + '/new_test.clean.zh', 'r') as f:
references = [line.strip() for line in f]
with open('/workspace/prediction.txt', 'r') as f:
predictions = [line.strip() for line in f]
print(f"Number of reference sentences: {len(references)}")
print(f"Number of predicted sentences: {len(predictions)}")
```

```=
count = 0
for i, (ref, pred) in enumerate(zip(references, predictions)):
if ref != pred:
print(f"[{i}] REF: {ref}")
print(f"[{i}] PRED: {pred}")
print()
count += 1
if count >= 10:
break
```

```=
!pip install sacrebleu
import sacrebleu
bleu = sacrebleu.corpus_bleu(predictions, [references])
print(f"BLEU score: {bleu.score:.2f}")
```