# 【Hung-yi Lee 機器學習 - L5 : Transformer Sequence-to-sequence 】 :::info - 參考 [2021 Spring](https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.php)、[2022 Spring](https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php)、[2023 Spring](https://speech.ee.ntu.edu.tw/~hylee/ml/2023-spring.php)、[2024 Spring](https://speech.ee.ntu.edu.tw/~hylee/genai/2024-spring.php)、[2025](https://speech.ee.ntu.edu.tw/~hylee/ml/2025-spring.php) - Knowledge Distillation(知識蒸餾) - Lecture 5 : Transformer Sequence-to-sequence - HW 5 : Transformer ::: <br/> ## Knowledge Distillation(知識蒸餾) 是一種模型壓縮技術,讓一個訓練好的「老師模型」,教一個「學生模型」學它的結果,達成模型變小但效果還不錯的目標 ![截圖 2025-04-11 凌晨12.45.53](https://hackmd.io/_uploads/By5RSdS0yl.png) >PS 複習 softmax 機率相加會=1 >![截圖 2025-04-11 晚上9.03.34](https://hackmd.io/_uploads/HkUV7cLCkg.png) softmax 都除以一個數字T,讓老師模型變得平滑,有助於學生模型學習 ![截圖 2025-04-11 晚上9.07.29](https://hackmd.io/_uploads/ByruE9LAke.png) 也可以使用更少的bits、分群、Binary Network 來壓縮model ![截圖 2025-04-11 晚上9.18.45](https://hackmd.io/_uploads/rJuTLcL0kx.png) ![截圖 2025-04-11 晚上9.21.54](https://hackmd.io/_uploads/r1MYP9LR1g.png) ![截圖 2025-04-11 晚上9.22.39](https://hackmd.io/_uploads/By1hP98Cye.png) Depthwise Separable Convolution 是一種高效的卷積操作,在不顯著犧牲模型準確率的前提下,降低計算量和參數數量 (複習)CNN 2個channel,每個fileter高度=2,數量沒有限制 因此 input channel 可以不等於 output channel ![截圖 2025-04-11 晚上9.25.27](https://hackmd.io/_uploads/Byaqd9IAkl.png) Depthwise Separable Convolution channel 數量 = filter 數量,每個filter只管一個channel 因此 input channel 要等於 output channel ![截圖 2025-04-11 晚上9.35.42](https://hackmd.io/_uploads/HJili9UCJl.png) ![截圖 2025-04-11 晚上9.36.31](https://hackmd.io/_uploads/S1XeocLR1l.png) ![截圖 2025-04-11 晚上9.39.08](https://hackmd.io/_uploads/rkN9s9ICJg.png) 但Depthwise Convolution 無法看出channel關聯性 可以使用Pointwise Convolution(強制要求filter要1x1) ![截圖 2025-04-11 晚上9.40.14](https://hackmd.io/_uploads/HkjInc8Ryl.png) ![截圖 2025-04-11 晚上9.45.14](https://hackmd.io/_uploads/Bkhxa9URyl.png) ![截圖 2025-04-12 凌晨1.01.09](https://hackmd.io/_uploads/H1wyoaLRye.png) ![截圖 2025-04-12 凌晨1.02.02](https://hackmd.io/_uploads/S1hMs6LCkx.png) ![截圖 2025-04-12 凌晨1.03.53](https://hackmd.io/_uploads/BktFjaIA1l.png) ![截圖 2025-04-12 凌晨1.04.14](https://hackmd.io/_uploads/SyCcsaLRJe.png) ![截圖 2025-04-12 凌晨1.04.39](https://hackmd.io/_uploads/B1v2s6IRyg.png) <br/> ## Lecture 4 : Transformer Sequence-to-sequence Transformer 是一種完全基於注意力機制(Self-Attention)的模型架構,擅長處理序列資料,取代了傳統的 RNN 和 LSTM - Self-Attention 自注意力機制 : 讓模型自己去「注意」輸入序列中的關鍵部分。比如翻譯一句話時,模型能知道「他」指的是「John」而不是其他人。用三個向量計算:Query(Q)、Key(K)、Value(V),輸出是一個「加權平均」的 Value,權重來自 Query 和 Key 的相似度。 - 多頭注意力(Multi-Head Attention): 把 self-attention 做很多次(多個 head),每個 head 可以專注於不同部分 - Positional Encoding : transformer 沒有 RNN 的「順序性」,所以需要加入位置資訊,加進一組 sin/cos 函數編碼,讓模型知道第幾個詞 ![截圖 2025-04-12 下午3.46.41](https://hackmd.io/_uploads/HJzd9cDAyl.png) ![截圖 2025-04-12 下午3.47.32](https://hackmd.io/_uploads/B14j95PAJe.png) ![截圖 2025-04-12 下午3.48.42](https://hackmd.io/_uploads/SktyoqwAkx.png) Encoder Decoder Encoder 負責理解輸入,Decoder 負責生成輸出 ![截圖 2025-04-12 下午6.21.09](https://hackmd.io/_uploads/S1ViAnDRJg.png) ![截圖 2025-04-12 下午6.22.09](https://hackmd.io/_uploads/B1W1k6wCJe.png) ![截圖 2025-04-12 晚上9.39.01](https://hackmd.io/_uploads/SynZak_0Jx.png) ![截圖 2025-04-12 晚上9.53.40](https://hackmd.io/_uploads/S1NOel_A1x.png) ![截圖 2025-04-12 晚上9.54.17](https://hackmd.io/_uploads/S1_cllOC1g.png) ![截圖 2025-04-12 晚上9.56.06](https://hackmd.io/_uploads/H1IZZeu01e.png) ![截圖 2025-04-12 晚上9.58.51](https://hackmd.io/_uploads/rygn-lu01e.png) ![截圖 2025-04-12 晚上10.02.58](https://hackmd.io/_uploads/ByMoGgdRyx.png) CTC 是一種損失函數,用來訓練模型,在不知道輸出對齊位置的情況下,仍能正確預測序列 假設要做語音辨識,輸入:1200 個時間點的聲波特徵,要輸出:hello(5 個字母),在不知道哪個聲波對應哪個字母情況下,用 CTC 就可以讓模型自己學會如何把長序列對應成短序 例如,列出所有能 collapse 成正確輸出的 path(例如 "h∅e∅l∅l∅o"、"hheelloo" 都等於 "hello" ![截圖 2025-04-13 凌晨1.31.43](https://hackmd.io/_uploads/Hkz97mdA1e.png) <br/> ## HW 5 : Transformer ![截圖 2025-04-13 下午3.36.20](https://hackmd.io/_uploads/ry2iKkY0yx.png) - Paired data : TED2020 有對應的 英文原版和中文翻譯 - Monolingual data : 純中文字 要訓練 RNN or Transofromer 翻譯模型 python 需要降版 ```= !python3 --version ``` ![截圖 2025-05-04 下午1.24.46](https://hackmd.io/_uploads/BybNqOVele.png) ```= !pip install 'torch>=1.6.0' editdistance matplotlib sacrebleu sacremoses sentencepiece tqdm wandb ``` 下載檔案 ```= !git clone https://github.com/pytorch/fairseq.git ``` ```= !pip install pip==24.0 !cd fairseq && git checkout 3f6ba43 !pip install --upgrade /workspace/fairseq ``` ```= import sys import pdb import pprint import logging import os import random import torch import torch.nn as nn import torch.nn.functional as F from torch.utils import data import numpy as np import tqdm.auto as tqdm from pathlib import Path from argparse import Namespace from fairseq import utils import matplotlib.pyplot as plt ``` ```= seed = 33 random.seed(seed) torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) np.random.seed(seed) torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True ``` 解壓縮 ```= from pathlib import Path data_dir = '/workspace/DATA/rawdata' dataset_name = 'ted2020' urls = ( "https://github.com/figisiwirf/ml2023-hw5-dataset/releases/download/v1.0.1/ml2023.hw5.data.tgz", "https://github.com/figisiwirf/ml2023-hw5-dataset/releases/download/v1.0.1/ml2023.hw5.test.tgz" ) file_names = ( 'ted2020.tgz', # train & dev 'test.tgz', # test ) prefix = Path(data_dir).absolute() / dataset_name print(prefix) # 創建資料夾 prefix.mkdir(parents=True, exist_ok=True) for u, f in zip(urls, file_names): path = prefix / f if not path.exists(): # 使用 curl 代替 wget !curl -L {u} -o {path} if path.suffix == ".tgz": !tar -xvf {path} -C {prefix} elif path.suffix == ".zip": !unzip -o {path} -d {prefix} # 檢查解壓的文件是否存在,並進行移動 if (prefix / 'raw.en').exists() and (prefix / 'raw.zh').exists(): !mv {prefix/'raw.en'} {prefix/'train_dev.raw.en'} !mv {prefix/'raw.zh'} {prefix/'train_dev.raw.zh'} print("raw已成功移動") else: print("解壓後的 raw.en 或 raw.zh 文件不存在!") if (prefix / 'test.en').exists() and (prefix / 'test.zh').exists(): !mv {prefix/'test.en'} {prefix/'test.raw.en'} !mv {prefix/'test.zh'} {prefix/'test.raw.zh'} print("test已成功移動") else: print("解壓後的 test.en 或 test.zh 文件不存在!") ``` ![截圖 2025-05-04 下午1.33.20](https://hackmd.io/_uploads/HJWEnuVlxx.png) ![截圖 2025-05-04 下午1.33.41](https://hackmd.io/_uploads/BkHShONggl.png) ![截圖 2025-05-04 下午1.35.14](https://hackmd.io/_uploads/SkKs2OEgll.png) Language ```= # src_lang = 'en' tgt_lang = 'zh' with open(f"{data_prefix}.{src_lang}", "r", encoding="utf-8") as file: lines = file.readlines() for line in lines[:30]: print(line.strip()) with open(f"{data_prefix}.{tgt_lang}", "r", encoding="utf-8") as file: lines = file.readlines() for line in lines[:30]: print(line.strip()) ``` ![截圖 2025-05-04 下午1.35.58](https://hackmd.io/_uploads/rkUCnuVxle.png) ![截圖 2025-05-04 下午1.36.04](https://hackmd.io/_uploads/BJEypdNxle.png) Preprocess files ```= # 清理語料庫(corpus)中中英文的句子配對 import re def strQ2B(ustring): """Full width -> half width""" ss = [] for s in ustring: rstring = "" for uchar in s: inside_code = ord(uchar) # 遍歷 s 中的每一個字元 uchar,取得其 Unicode 碼點 if inside_code == 12288: # 全形空白(Unicode 12288),轉為半形空白(Unicode 32) inside_code = 32 elif (inside_code >= 65281 and inside_code <= 65374): # 如果是其他全形符號(範圍在 65281~65374),則減去 65248 變為對應的半形符號 inside_code -= 65248 rstring += chr(inside_code) # 轉換好的字符拼成一個新字串,加入 ss ss.append(rstring) return ''.join(ss) def clean_s(s, lang): if lang == 'en': s = re.sub(r"\([^()]*\)", "", s) # 移除小括號內的文字 s = s.replace('-', '') # 移除 '-' s = re.sub('([.,;!?()\"])', r' \1 ', s) # 標點符號前後加上空格 elif lang == 'zh': s = strQ2B(s) # Q2B 全形轉半形 s = re.sub(r"\([^()]*\)", "", s) s = s.replace(' ', '') s = s.replace('—', '') s = s.replace('“', '"') s = s.replace('”', '"') s = s.replace('_', '') s = re.sub('([。,;!?()\"~「」])', r' \1 ', s) s = ' '.join(s.strip().split()) return s def len_s(s, lang): if lang == 'zh': return len(s) # 中文的長度直接用 len(s)(因為每個字就是一個 token) return len(s.split()) # 英文用 split() 分開單字後數量當作長度 def clean_corpus(prefix, l1, l2, ratio=9, max_len=1000, min_len=1): # ratio 長度比例過高就剃除(避免配對錯誤) if Path(f'{prefix}.clean.{l1}').exists() and Path(f'{prefix}.clean.{l2}').exists(): print(f'{prefix}.clean.{l1} & {l2} exists. skipping clean.') return with open(f'{prefix}.{l1}', 'r') as l1_in_f: with open(f'{prefix}.{l2}', 'r') as l2_in_f: with open(f'{prefix}.clean.{l1}', 'w') as l1_out_f: with open(f'{prefix}.clean.{l2}', 'w') as l2_out_f: for s1 in l1_in_f: s1 = s1.strip() s2 = l2_in_f.readline().strip() # 逐行讀取 s1 = clean_s(s1, l1) s2 = clean_s(s2, l2) s1_len = len_s(s1, l1) s2_len = len_s(s2, l2) if min_len > 0: # 移除太短的句子 if s1_len < min_len or s2_len < min_len: continue if max_len > 0: # 移除太長的句子 if s1_len > max_len or s2_len > max_len: continue if ratio > 0: # 一邊句子比另一邊長太多,判定為配對錯誤 if s1_len/s2_len > ratio or s2_len/s1_len > ratio: continue print(s1, file=l1_out_f) print(s2, file=l2_out_f) ``` ```= clean_corpus(data_prefix, src_lang, tgt_lang) clean_corpus(test_prefix, src_lang, tgt_lang, ratio=-1, min_len=-1, max_len=-1) # 清洗測試資料,不套用任何長度過 ``` ```= print(f"{data_prefix}.clean.{src_lang}") print(f"{data_prefix}.clean.{tgt_lang}") print(f"{test_prefix}.clean.{src_lang}") print(f"{test_prefix}.clean.{tgt_lang}") ``` ![截圖 2025-05-04 下午1.38.00](https://hackmd.io/_uploads/SkKST_4lxe.png) 切分訓練 驗證 測試集 ```= valid_ratio = 0.01 # 3000~4000 would suffice test_ratio = 0.005 train_ratio = 1 - valid_ratio - test_ratio ``` ```= print(f'{prefix}/train.raw.en') # original,解壓縮後刪除 print(f'{prefix}/train_dev.raw.en') # 解壓縮 print(f'{data_prefix}.clean.{src_lang}') # 清洗後的 print(prefix/f'new_train.clean.{src_lang}') # 之後要重新分配的 ``` ![截圖 2025-05-04 下午1.40.18](https://hackmd.io/_uploads/Hk4ATdNglx.png) ```= if (prefix/f'new_train.clean.{src_lang}').exists() \ and (prefix/f'new_train.clean.{tgt_lang}').exists() \ and (prefix/f'new_valid.clean.{src_lang}').exists() \ and (prefix/f'new_valid.clean.{tgt_lang}').exists() \ and (prefix/f'new_test.clean.{src_lang}').exists() \ and (prefix/f'new_test.clean.{tgt_lang}').exists(): print(f'train/valid/test splits exists. skipping split.') else: line_num = sum(1 for line in open(f'{data_prefix}.clean.{src_lang}')) # 用清洗過後的檔切新檔案 labels = list(range(line_num)) # 建立隨機標籤序列(打亂順序) random.shuffle(labels) for lang in [src_lang, tgt_lang]: train_f = open(os.path.join(data_dir, dataset_name, f'new_train.clean.{lang}'), 'w') valid_f = open(os.path.join(data_dir, dataset_name, f'new_valid.clean.{lang}'), 'w') test_f = open(os.path.join(data_dir, dataset_name, f'new_test.clean.{lang}'), 'w') count = 0 for line in open(f'{data_prefix}.clean.{lang}', 'r'): p = labels[count] / line_num if p < train_ratio: train_f.write(line) elif p < train_ratio + valid_ratio: valid_f.write(line) else: test_f.write(line) count += 1 train_f.close() valid_f.close() test_f.close() ``` ```= if (prefix/f'new_train.clean.{src_lang}').exists() \ and (prefix/f'new_train.clean.{tgt_lang}').exists() \ and (prefix/f'new_valid.clean.{src_lang}').exists() \ and (prefix/f'new_valid.clean.{tgt_lang}').exists() \ and (prefix/f'new_test.clean.{src_lang}').exists() \ and (prefix/f'new_test.clean.{tgt_lang}').exists(): print(f'train/valid/test splits exists. skipping split.') for split in ['new_train', 'new_valid', 'new_test']: for lang in [src_lang, tgt_lang]: file_path = prefix / f'{split}.clean.{lang}' with open(file_path, 'r') as f: line_count = sum(1 for _ in f) print(f'{split}.{lang}: {line_count} lines') ``` ![截圖 2025-05-04 下午1.44.38](https://hackmd.io/_uploads/SkjACuElle.png) ```= # 訓練 SentencePiece 子詞模型 # 解決機器翻譯中的 OOV(Out-Of-Vocabulary)問題 # 在機器翻譯中,遇到「未曾見過的單字」是常見問題(例如某個新名詞、拼錯的字等) # 用 subword units(子詞單位) 把單字拆成更小的單位(例如字根、詞綴、字母)可以緩解這個問題 # import sentencepiece as spm # 引入 Google 開發的 sentencepiece 套件,用來訓練 subword 模型。 # 它支援常見的演算法,如 unigram(機率模型)和 BPE(Byte Pair Encoding,基於頻率合併) import sentencepiece as spm vocab_size = 8000 if (prefix/f'spm{vocab_size}.model').exists(): print(f'{prefix}/spm{vocab_size}.model exists. skipping spm_train.') else: spm.SentencePieceTrainer.train( input=','.join([f'{prefix}/new_train.clean.{src_lang}', f'{prefix}/new_valid.clean.{src_lang}', f'{prefix}/new_train.clean.{tgt_lang}', f'{prefix}/new_valid.clean.{tgt_lang}']), model_prefix=prefix/f'spm{vocab_size}', # 最後會生成兩個檔案:spm8000.model(模型)、spm8000.vocab(詞表) vocab_size=vocab_size, character_coverage=1, model_type='unigram', # 概率模型 or 'bpe' input_sentence_size=1e6, # 只隨機取一百萬句來訓練,足夠且加快速度 shuffle_input_sentence=True, normalization_rule_name='nmt_nfkc_cf', # 特別為機器翻譯設計的正規化規則,會處理全形轉半形、大小寫統一等 ) ``` ```= spm_model = spm.SentencePieceProcessor(model_file=str(prefix/f'spm{vocab_size}.model')) in_tag = { 'train': 'new_train.clean', 'valid': 'new_valid.clean', 'test': 'new_test.clean', 'test_ori': 'test.raw.clean', } for split in ['train', 'valid', 'test', 'test_ori']: for lang in [src_lang, tgt_lang]: out_path = prefix/f'{split}.{lang}' if out_path.exists(): print(f"{out_path} exists. skipping spm_encode.") else: with open(prefix/f'{split}.{lang}', 'w') as out_f: with open(prefix/f'{in_tag[split]}.{lang}', 'r') as in_f: for line in in_f: line = line.strip() tok = spm_model.encode(line, out_type=str) print(' '.join(tok), file=out_f) ``` ![截圖 2025-05-04 下午1.46.23](https://hackmd.io/_uploads/BkccytNgxe.png) Binarize the data with fairseq ```= # Fairseq 所需的二進位格式 # prep 姿料來源 from pathlib import Path binpath = Path('/workspace/DATA/data-bin', dataset_name) if binpath.exists(): print(binpath, "exists, will not overwrite!") else: cmd = f""" python -m fairseq_cli.preprocess \\ --source-lang {src_lang} \\ --target-lang {tgt_lang} \\ --trainpref {prefix}/train \\ --validpref {prefix}/valid \\ --testpref {prefix}/test \\ --destdir {binpath} \\ --joined-dictionary \\ --workers 2 """ print("Running command:\n", cmd) os.system(cmd) # 指令在 Python 內直接執行 ``` Configuration for Experiments ```= # 使用 Fairseq 訓練翻譯模型時的config config = Namespace( datadir = "/workspace/DATA/data-bin/ted2020", savedir = "/workspace/checkpoints/rnn", source_lang = "en", target_lang = "zh", num_workers=2, max_tokens=8192, accum_steps=2, # gradient accumulation 次數 # 學習率與優化器 lr_factor=2., lr_warmup=4000, # 梯度控制 clip_norm=1.0, # 訓練輪數控制 max_epoch=30, start_epoch=1, # beam size。數值越大,翻譯質量可能提升,但速度變慢 beam=5, # 翻譯句子長度上限是 1.2 * 原句長度 + 10 max_len_a=1.2, max_len_b=10, # 會自動去除 ▁、</s> 這些符號 post_process = "sentencepiece", # 保留最近 5 個 epoch 的 checkpoint keep_last_epochs=5, resume=None, # 從頭訓練 # logging 若設為 True,需先 pip install wandb 並登入帳號 use_wandb=False, ) ``` Logging ```= # log 輸出格式 logging.basicConfig( format="%(asctime)s | %(levelname)s | %(name)s | %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level="INFO", # "DEBUG" "WARNING" "ERROR" stream=sys.stdout, ) proj = "hw5.seq2seq" logger = logging.getLogger(proj) if config.use_wandb: import wandb wandb.init(project=proj, name=Path(config.savedir).stem, config=config) ``` ```= logger.info("Start training...") logger.warning("Something may go wrong.") ``` CUDA Environment ```= cuda_env = utils.CudaEnvironment() utils.CudaEnvironment.pretty_print_cuda_env_list([cuda_env]) device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') ``` borrow the TranslationTask from fairseq ```= # 設定與測試 fairseq 中的 TranslationTask,用來讀取先前 binarized 的資料 (data-bin) from fairseq.tasks.translation import TranslationConfig, TranslationTask # task_cfg = TranslationConfig( data=config.datadir, # 資料目錄 (data-bin 資料) source_lang=config.source_lang, # 原始語言 (例如 "en") target_lang=config.target_lang, # 目標語言 (例如 "zh") train_subset="train", # 訓練集的分割名稱 required_seq_len_multiple=8, # 對齊 padding 長度以加速訓練 dataset_impl="mmap", # 使用 memory map 加快讀取 upsample_primary=1, # 是否放大原始資料的比例(多數只用 1) ) # 根據 task_cfg 建立翻譯任務,裡面會內建好: # 資料載入方法 # task.source_dictionary 與 task.target_dictionary 物件 # beam search decoder task = TranslationTask.setup_task(task_cfg) ``` ```= # 從 data-bin/ted2020 這個資料夾中讀取 train.en-zh.en 和 train.en-zh.zh 這兩個檔案組成的資料集 logger.info("loading data for epoch 1") task.load_dataset(split="train", epoch=1, combine=True) task.load_dataset(split="valid", epoch=1) ``` ```= sample = task.dataset("valid")[1] pprint.pprint(sample) pprint.pprint( "Source: " + \ task.source_dictionary.string( sample['source'], config.post_process, ) ) pprint.pprint( "Target: " + \ task.target_dictionary.string( sample['target'], config.post_process, ) ) ``` ![截圖 2025-05-04 下午3.56.26](https://hackmd.io/_uploads/rJ1TacVegx.png) Dataset Iterator ```= def load_data_iterator(task, split, epoch=1, max_tokens=4000, num_workers=1, cached=True): batch_iterator = task.get_batch_iterator( dataset=task.dataset(split), max_tokens=max_tokens, max_sentences=None, max_positions=utils.resolve_max_positions( task.max_positions(), max_tokens, ), ignore_invalid_inputs=True, seed=seed, num_workers=num_workers, epoch=epoch, disable_iterator_cache=not cached, # Set this to False to speed up. However, if set to False, changing max_tokens beyond # first call of this method has no effect. ) return batch_iterator demo_epoch_obj = load_data_iterator(task, "valid", epoch=1, max_tokens=20, num_workers=1, cached=False) demo_iter = demo_epoch_obj.next_epoch_itr(shuffle=True) sample = next(demo_iter) sample ``` ![截圖 2025-05-04 下午3.58.04](https://hackmd.io/_uploads/SJnzA9Ngle.png) ```= batch = { "id": id, # id for each example "nsentences": len(samples), # batch size (sentences) "ntokens": ntokens, # batch size (tokens) "net_input": { "src_tokens": src_tokens, # sequence in source language "src_lengths": src_lengths, # sequence length of each example before padding "prev_output_tokens": prev_output_tokens, # right shifted target, as mentioned above. }, "target": target, # target sequence } ``` Model Architecture ```= from fairseq.models import ( FairseqEncoder, FairseqIncrementalDecoder, FairseqEncoderDecoderModel ) ``` Enconder ```= class RNNEncoder(FairseqEncoder): def __init__(self, args, dictionary, embed_tokens): super().__init__(dictionary) self.embed_tokens = embed_tokens # 接收已建立好的 nn.Embedding,把 token ID 轉成向量 # 儲存嵌入層(nn.Embedding),負責把 token ID 轉成向量 self.embed_dim = args.encoder_embed_dim self.hidden_dim = args.encoder_ffn_embed_dim self.num_layers = args.encoder_layers self.dropout_in_module = nn.Dropout(args.dropout) # 把參數存在實例中 self.rnn = nn.GRU( self.embed_dim, # embedding 輸出維度 self.hidden_dim, # GRU 隱藏層的維度 self.num_layers, # GRU 疊了幾層 dropout=args.dropout, # 正則,隨機把一些神經元的輸出變成 0(即暫時不使用它們) batch_first=False, # 輸入格式是 [seq_len, batch, dim] bidirectional=True # 雙向 RNN,會輸出 forward 和 backward 的結果 ) self.dropout_out_module = nn.Dropout(args.dropout) # 到 padding 的 token index,之後做 masking 會用 self.padding_idx = dictionary.pad() # 將 bidirectional RNN 的 final_hiddens 合併起來(因為有 forward 和 backward 兩個方向) def combine_bidir(self, outs, bsz: int): out = outs.view(self.num_layers, 2, bsz, -1).transpose(1, 2).contiguous() # 方向和 batch 維度對調 return out.view(self.num_layers, bsz, -1) # 最後合併兩個方向 → hidden * 2 # 取得 batch size 和句子長度 def forward(self, src_tokens, **unused): bsz, seqlen = src_tokens.size() # 將每個 token 轉為向量並加上 dropout。 x = self.embed_tokens(src_tokens) x = self.dropout_in_module(x) # B x T x C -> T x B x C # 改成 RNN 所需的格式 [seq_len, batch_size, embed_dim] x = x.transpose(0, 1) # pass thru bidirectional RNN # 雙向 → 要初始化 2 × num_layers 個 hidden state # 把嵌入後的輸入與初始 hidden state 丟進 GRU # GRU 輸出後加 dropout h0 = x.new_zeros(2 * self.num_layers, bsz, self.hidden_dim) x, final_hiddens = self.rnn(x, h0) outputs = self.dropout_out_module(x) # outputs = [sequence len, batch size, hid dim * directions] # hidden = [num_layers * directions, batch size , hid dim] # 編碼器是雙向的,我們需要連接兩個方向的隱藏狀態 # forward + backward 的 hidden state 合併 final_hiddens = self.combine_bidir(final_hiddens, bsz) # hidden = [num_layers x batch x num_directions*hidden] # 建立 padding mask encoder_padding_mask = src_tokens.eq(self.padding_idx).t() return tuple( ( outputs, # seq_len x batch x hidden final_hiddens, # num_layers x batch x num_directions*hidden encoder_padding_mask, # seq_len x batch ) ) # 為了配合 beam search,在 decoder 要根據 beam 的排序重新安排 encoder 輸出 # 對 outputs, final_hiddens, mask 都按照 new_order 重新排列 def reorder_encoder_out(self, encoder_out, new_order): return tuple( ( encoder_out[0].index_select(1, new_order), encoder_out[1].index_select(1, new_order), encoder_out[2].index_select(1, new_order), ) ) ``` Attention ```= # bias: 線性層是否包含偏置項 class AttentionLayer(nn.Module): def __init__(self, input_embed_dim, source_embed_dim, output_embed_dim, bias=False): super().__init__() self.input_proj = nn.Linear(input_embed_dim, source_embed_dim, bias=bias) self.output_proj = nn.Linear( input_embed_dim + source_embed_dim, output_embed_dim, bias=bias ) # 前向傳播(Forward 方法) def forward(self, inputs, encoder_outputs, encoder_padding_mask): # inputs: T, B, dim # encoder_outputs: S x B x dim,(S, B, dim),其中 S 是源序列長度,dim 是 source_embed_dim # padding mask: S x B,形狀為 (S, B),用於標記源序列中的填充位置(True 表示填充,False 表示有效) # convert all to batch first inputs = inputs.transpose(1,0) # 從 (T, B, dim)(序列優先)轉換為 (B, T, dim)(批次優先) encoder_outputs = encoder_outputs.transpose(1,0) # B, S, dim encoder_padding_mask = encoder_padding_mask.transpose(1,0) # B, S # 將解碼器輸入(inputs)通過 input_proj 線性層投影到與編碼器輸出相同的維度 x = self.input_proj(inputs) # 計算注意力分數 # (B, T, dim) x (B, dim, S) = (B, T, S) attn_scores = torch.bmm(x, encoder_outputs.transpose(1,2)) # 如果提供了 encoder_padding_mask,則處理填充位置的注意力分數 if encoder_padding_mask is not None: # leveraging broadcast B, S -> (B, 1, S) # 將填充位置的注意力分數設置為負無窮(float("-inf")),以確保後續 softmax 時這些位置的權重為 0 encoder_padding_mask = encoder_padding_mask.unsqueeze(1) attn_scores = ( attn_scores.float() .masked_fill_(encoder_padding_mask, float("-inf")) .type_as(attn_scores) ) # FP16 support: cast to float and back # 對注意力分數應用 softmax,沿著最後一維(源序列維度 S)進行歸一化,得到注意力權重 attn_scores = F.softmax(attn_scores, dim=-1) # 使用注意力權重對編碼器輸出進行加權求和,計算上下文向量 x = torch.bmm(attn_scores, encoder_outputs) # (B, T, dim) # 將上下文向量(x)與原始解碼器輸入(inputs)沿著最後一維(特徵維度)拼接 x = torch.cat((x, inputs), dim=-1) x = torch.tanh(self.output_proj(x)) # concat + linear + tanh # restore shape (B, T, dim) -> (T, B, dim) # 將拼接後的向量通過 output_proj 線性層投影到 output_embed_dim,並應用 tanh 激活函數 return x.transpose(1,0), attn_scores ``` Decoder ```= class RNNDecoder(FairseqIncrementalDecoder): def __init__(self, args, dictionary, embed_tokens): super().__init__(dictionary) self.embed_tokens = embed_tokens # 斷言檢查,確保編碼器和解碼器的 RNN 層數相同 assert args.decoder_layers == args.encoder_layers, f"""seq2seq rnn requires that encoder and decoder have same layers of rnn. got: {args.encoder_layers, args.decoder_layers}""" # 斷言檢查,確保解碼器的隱藏層維度(decoder_ffn_embed_dim)是編碼器隱藏層維度的兩倍 assert args.decoder_ffn_embed_dim == args.encoder_ffn_embed_dim*2, f"""seq2seq-rnn requires that decoder hidden to be 2*encoder hidden dim. got: {args.decoder_ffn_embed_dim, args.encoder_ffn_embed_dim*2}""" # 從 args 中提取並保存解碼器的超參數 self.embed_dim = args.decoder_embed_dim self.hidden_dim = args.decoder_ffn_embed_dim self.num_layers = args.decoder_layers # 創建一個單向 GRU 層 self.dropout_in_module = nn.Dropout(args.dropout) self.rnn = nn.GRU( self.embed_dim, self.hidden_dim, self.num_layers, dropout=args.dropout, batch_first=False, bidirectional=False ) # 創建一個 AttentionLayer 實例(參見你之前提供的注意力層代碼 self.attention = AttentionLayer( self.embed_dim, self.hidden_dim, self.embed_dim, bias=False ) self.dropout_out_module = nn.Dropout(args.dropout) # 如果 GRU 隱藏維度(hidden_dim)與嵌入維度(embed_dim)不同,創建一個線性層將隱藏狀態投影到嵌入維度;否則設置為 None if self.hidden_dim != self.embed_dim: self.project_out_dim = nn.Linear(self.hidden_dim, self.embed_dim) else: self.project_out_dim = None # 如果 args.share_decoder_input_output_embed 為 True,則共享輸入嵌入層(embed_tokens)和輸出投影層的權重 # 如果不共享權重,則創建一個獨立的線性層,將輸出維度從 output_embed_dim(等於 embed_dim)投影到詞彙表大小 if args.share_decoder_input_output_embed: self.output_projection = nn.Linear( self.embed_tokens.weight.shape[1], self.embed_tokens.weight.shape[0], bias=False, ) self.output_projection.weight = self.embed_tokens.weight else: self.output_projection = nn.Linear( self.output_embed_dim, len(dictionary), bias=False ) nn.init.normal_( self.output_projection.weight, mean=0, std=self.output_embed_dim ** -0.5 ) # 定義前向傳播函數 def forward(self, prev_output_tokens, encoder_out, incremental_state=None, **unused): # extract the outputs from encoder encoder_outputs, encoder_hiddens, encoder_padding_mask = encoder_out # outputs: seq_len x batch x num_directions*hidden # encoder_hiddens: num_layers x batch x num_directions*encoder_hidden # padding_mask: seq_len x batch # 如果 incremental_state 存在(推理時),則只使用最後一個 token(prev_output_tokens[:, -1:]), # 並從 incremental_state 中恢復上一步的隱藏狀態(prev_hiddens)。 # 否則(訓練或推理的第一步),使用編碼器的最終隱藏狀態(encoder_hiddens)作為初始隱藏狀態 if incremental_state is not None and len(incremental_state) > 0: # if the information from last timestep is retained, we can continue from there instead of starting from bos prev_output_tokens = prev_output_tokens[:, -1:] cache_state = self.get_incremental_state(incremental_state, "cached_state") prev_hiddens = cache_state["prev_hiddens"] else: # incremental state does not exist, either this is training time, or the first timestep of test time # prepare for seq2seq: pass the encoder_hidden to the decoder hidden states # 獲取批次大小(bsz)和序列長度(seqlen) prev_hiddens = encoder_hiddens bsz, seqlen = prev_output_tokens.size() # embed tokens x = self.embed_tokens(prev_output_tokens) x = self.dropout_in_module(x) # B x T x C -> T x B x C # 將嵌入向量的維度從 (batch_size, seq_len, embed_dim) 轉為 (seq_len, batch_size, embed_dim) x = x.transpose(0, 1) # decoder-to-encoder attention # 如果使用了注意力層,則調用 AttentionLayer 計算上下文向量和注意力分數 if self.attention is not None: x, attn = self.attention(x, encoder_outputs, encoder_padding_mask) # pass thru unidirectional RNN # 將注意力輸出(或嵌入)輸入單向 GRU,生成新的隱藏狀態 x, final_hiddens = self.rnn(x, prev_hiddens) # outputs = [sequence len, batch size, hid dim] # hidden = [num_layers * directions, batch size , hid dim] x = self.dropout_out_module(x) # 如果 hidden_dim 與 embed_dim 不同,則通過 project_out_dim 線性層將 GRU 輸出投影到 embed_dim if self.project_out_dim != None: x = self.project_out_dim(x) # 通過 output_projection 線性層將輸出投影到詞彙表大小 x = self.output_projection(x) # T x B x C -> B x T x C # 將輸出從序列優先 (seq_len, batch_size, vocab_size) 轉為批次優先 (batch_size, seq_len, vocab_size) x = x.transpose(1, 0) # 如果是增量解碼,保存當前時間步的隱藏狀態(final_hiddens)到 incremental_state 中 cache_state = { "prev_hiddens": final_hiddens, } self.set_incremental_state(incremental_state, "cached_state", cache_state) return x, None # 定義一個方法,用於在束搜索(beam search)中重排序增量狀態 def reorder_incremental_state( self, incremental_state, new_order, ): # 定義一個方法,用於在束搜索(beam search)中重排序增量狀態 cache_state = self.get_incremental_state(incremental_state, "cached_state") prev_hiddens = cache_state["prev_hiddens"] prev_hiddens = [p.index_select(0, new_order) for p in prev_hiddens] # 更新緩存狀態並保存到 incremental_state cache_state = { "prev_hiddens": torch.stack(prev_hiddens), } self.set_incremental_state(incremental_state, "cached_state", cache_state) return ``` Seq2Seq ```= # 序列到序列(Seq2Seq)模型 # Fairseq 的 FairseqEncoderDecoderModel,結構由 編碼器(Encoder)+ 解碼器(Decoder) 組成 class Seq2Seq(FairseqEncoderDecoderModel): def __init__(self, args, encoder, decoder): super().__init__(encoder, decoder) self.args = args def forward( self, src_tokens, src_lengths, prev_output_tokens, return_all_hiddens: bool = True, ): """ Run the forward pass for an encoder-decoder model. """ # Encoder 編碼輸入 encoder_out = self.encoder( src_tokens, src_lengths=src_lengths, return_all_hiddens=return_all_hiddens ) # Decoder 根據 encoder 輸出產生 logits(分類分數) logits, extra = self.decoder( prev_output_tokens, encoder_out=encoder_out, src_lengths=src_lengths, return_all_hiddens=return_all_hiddens, ) return logits, extra ``` Model Initialization ```= # # HINT: transformer architecture # from fairseq.models.transformer import ( # TransformerEncoder, # TransformerDecoder, # ) def build_model(args, task): """ build a model instance based on hyperparameters """ src_dict, tgt_dict = task.source_dictionary, task.target_dictionary # token embeddings encoder_embed_tokens = nn.Embedding(len(src_dict), args.encoder_embed_dim, src_dict.pad()) decoder_embed_tokens = nn.Embedding(len(tgt_dict), args.decoder_embed_dim, tgt_dict.pad()) # encoder decoder # HINT: TODO: switch to TransformerEncoder & TransformerDecoder encoder = RNNEncoder(args, src_dict, encoder_embed_tokens) decoder = RNNDecoder(args, tgt_dict, decoder_embed_tokens) # sequence to sequence model model = Seq2Seq(args, encoder, decoder) # initialization for seq2seq model is important, requires extra handling def init_params(module): from fairseq.modules import MultiheadAttention if isinstance(module, nn.Linear): module.weight.data.normal_(mean=0.0, std=0.02) if module.bias is not None: module.bias.data.zero_() if isinstance(module, nn.Embedding): module.weight.data.normal_(mean=0.0, std=0.02) if module.padding_idx is not None: module.weight.data[module.padding_idx].zero_() if isinstance(module, MultiheadAttention): module.q_proj.weight.data.normal_(mean=0.0, std=0.02) module.k_proj.weight.data.normal_(mean=0.0, std=0.02) module.v_proj.weight.data.normal_(mean=0.0, std=0.02) if isinstance(module, nn.RNNBase): for name, param in module.named_parameters(): if "weight" in name or "bias" in name: param.data.uniform_(-0.1, 0.1) # weight initialization model.apply(init_params) return model ``` Architecture Related Configuration ```= arch_args = Namespace( encoder_embed_dim=256, encoder_ffn_embed_dim=512, encoder_layers=1, decoder_embed_dim=256, decoder_ffn_embed_dim=1024, decoder_layers=1, share_decoder_input_output_embed=True, dropout=0.3, ) # # HINT: these patches on parameters for Transformer # def add_transformer_args(args): # args.encoder_attention_heads=4 # args.encoder_normalize_before=True # args.decoder_attention_heads=4 # args.decoder_normalize_before=True # args.activation_fn="relu" # args.max_source_positions=1024 # args.max_target_positions=1024 # # patches on default parameters for Transformer (those not set above) # from fairseq.models.transformer import base_architecture # base_architecture(arch_args) # add_transformer_args(arch_args) ``` ```= if config.use_wandb: wandb.config.update(vars(arch_args)) ``` ```= model = build_model(arch_args, task) logger.info(model) ``` Loss: Label Smoothing Regularization ```= class LabelSmoothedCrossEntropyCriterion(nn.Module): def __init__(self, smoothing, ignore_index=None, reduce=True): super().__init__() self.smoothing = smoothing self.ignore_index = ignore_index self.reduce = reduce def forward(self, lprobs, target): if target.dim() == lprobs.dim() - 1: target = target.unsqueeze(-1) # nll: Negative log likelihood,the cross-entropy when target is one-hot. following line is same as F.nll_loss nll_loss = -lprobs.gather(dim=-1, index=target) # reserve some probability for other labels. thus when calculating cross-entropy, # equivalent to summing the log probs of all labels smooth_loss = -lprobs.sum(dim=-1, keepdim=True) if self.ignore_index is not None: pad_mask = target.eq(self.ignore_index) nll_loss.masked_fill_(pad_mask, 0.0) smooth_loss.masked_fill_(pad_mask, 0.0) else: nll_loss = nll_loss.squeeze(-1) smooth_loss = smooth_loss.squeeze(-1) if self.reduce: nll_loss = nll_loss.sum() smooth_loss = smooth_loss.sum() # when calculating cross-entropy, add the loss of other labels eps_i = self.smoothing / lprobs.size(-1) loss = (1.0 - self.smoothing) * nll_loss + eps_i * smooth_loss return loss # generally, 0.1 is good enough criterion = LabelSmoothedCrossEntropyCriterion( smoothing=0.1, ignore_index=task.target_dictionary.pad(), ) ``` Optimizer: Adam + lr scheduling ```= # 包裝器(wrapper),包在一個 optimizer(如 Adam)外層,它根據步數來動態改變學習率,而不是用固定的學習率 class NoamOpt: "Optim wrapper that implements rate." def __init__(self, model_size, factor, warmup, optimizer): self.optimizer = optimizer self._step = 0 self.warmup = warmup self.factor = factor self.model_size = model_size self._rate = 0 @property def param_groups(self): return self.optimizer.param_groups def multiply_grads(self, c): """Multiplies grads by a constant *c*.""" for group in self.param_groups: for p in group['params']: if p.grad is not None: p.grad.data.mul_(c) # 更新學習率並前進一步 def step(self): "Update parameters and rate" self._step += 1 rate = self.rate() for p in self.param_groups: p['lr'] = rate self._rate = rate self.optimizer.step() # 計算當前步數的學習率 def rate(self, step = None): "Implement `lrate` above" if step is None: step = self._step return 0 if not step else self.factor * \ (self.model_size ** (-0.5) * min(step ** (-0.5), step * self.warmup ** (-1.5))) ``` Scheduling Visualized ```= # 這樣的設計在 Transformer 模型中廣泛使用,因為它能幫助模型穩定訓練 optimizer = NoamOpt( model_size=arch_args.encoder_embed_dim, factor=config.lr_factor, warmup=config.lr_warmup, optimizer=torch.optim.AdamW(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9, weight_decay=0.0001)) plt.plot(np.arange(1, 100000), [optimizer.rate(i) for i in range(1, 100000)]) plt.legend([f"{optimizer.model_size}:{optimizer.warmup}"]) None ``` ![截圖 2025-05-04 下午4.16.44](https://hackmd.io/_uploads/Sy2ufoNele.png) 【Training Procedure】 Training ```= from fairseq.data import iterators from torch.cuda.amp import GradScaler, autocast def train_one_epoch(epoch_itr, model, task, criterion, optimizer, accum_steps=1): # 每次讀取一個新的 epoch,打亂順序(shuffle=True) # 每 accum_steps 個 sample 才做一次參數更新(gradient accumulation) itr = epoch_itr.next_epoch_itr(shuffle=True) itr = iterators.GroupedIterator(itr, accum_steps) # AMP + 梯度累積訓練 stats = {"loss": []} scaler = GradScaler() # automatic mixed precision (amp) model.train() progress = tqdm.tqdm(itr, desc=f"train epoch {epoch_itr.epoch}", leave=False) for samples in progress: model.zero_grad() accum_loss = 0 sample_size = 0 # gradient accumulation: update every accum_steps samples for i, sample in enumerate(samples): if i == 1: # emptying the CUDA cache after the first step can reduce the chance of OOM torch.cuda.empty_cache() sample = utils.move_to_cuda(sample, device=device) target = sample["target"] sample_size_i = sample["ntokens"] sample_size += sample_size_i # mixed precision training with autocast(): net_output = model.forward(**sample["net_input"]) lprobs = F.log_softmax(net_output[0], -1) loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1)) # logging accum_loss += loss.item() # back-prop scaler.scale(loss).backward() # 梯度更新 scaler.unscale_(optimizer) optimizer.multiply_grads(1 / (sample_size or 1.0)) # (sample_size or 1.0) handles the case of a zero gradient gnorm = nn.utils.clip_grad_norm_(model.parameters(), config.clip_norm) # grad norm clipping prevents gradient exploding scaler.step(optimizer) scaler.update() # logging loss_print = accum_loss/sample_size stats["loss"].append(loss_print) progress.set_postfix(loss=loss_print) if config.use_wandb: wandb.log({ "train/loss": loss_print, "train/grad_norm": gnorm.item(), "train/lr": optimizer.rate(), "train/sample_size": sample_size, }) loss_print = np.mean(stats["loss"]) logger.info(f"training loss: {loss_print:.4f}") return stats ``` Validation & Inference ```= # fairseq's beam search generator # given model and input seqeunce, produce translation hypotheses by beam search # # 利用 beam search 做翻譯推論 sequence_generator = task.build_generator([model], config) def decode(toks, dictionary): # 這個 helper function 用來把張量(token IDs)轉回人類可讀的文字 s = dictionary.string( toks.int().cpu(), config.post_process, ) return s if s else "<unk>" # 這行執行了 beam search 解碼,根據 sample 輸入,返回每個樣本的候選翻譯 # 輸出來源(src)、翻譯(hyp)、參考(ref) def inference_step(sample, model): gen_out = sequence_generator.generate([model], sample) srcs = [] hyps = [] refs = [] for i in range(len(gen_out)): # for each sample, collect the input, hypothesis and reference, later be used to calculate BLEU srcs.append(decode( utils.strip_pad(sample["net_input"]["src_tokens"][i], task.source_dictionary.pad()), task.source_dictionary, )) hyps.append(decode( gen_out[i][0]["tokens"], # 0 indicates using the top hypothesis in beam task.target_dictionary, )) refs.append(decode( utils.strip_pad(sample["target"][i], task.target_dictionary.pad()), task.target_dictionary, )) return srcs, hyps, refs ``` ```= import shutil import sacrebleu def validate(model, task, criterion, log_to_wandb=True): logger.info('begin validation') # 建立驗證集 iterator # batch size = 1(推論階段通常不會做 batching) itr = load_data_iterator(task, "valid", 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False) stats = {"loss":[], "bleu": 0, "srcs":[], "hyps":[], "refs":[]} srcs = [] hyps = [] refs = [] model.eval() progress = tqdm.tqdm(itr, desc=f"validation", leave=False) with torch.no_grad(): for i, sample in enumerate(progress): # validation loss # 前處理與 forward 推論 sample = utils.move_to_cuda(sample, device=device) net_output = model.forward(**sample["net_input"]) lprobs = F.log_softmax(net_output[0], -1) target = sample["target"] sample_size = sample["ntokens"] # 計算損失 loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1)) / sample_size progress.set_postfix(valid_loss=loss.item()) stats["loss"].append(loss) # Beam search 推論 + decode s, h, r = inference_step(sample, model) srcs.extend(s) hyps.extend(h) refs.extend(r) # BLEU 分數評估 tok = 'zh' if task.cfg.target_lang == 'zh' else '13a' stats["loss"] = torch.stack(stats["loss"]).mean().item() stats["bleu"] = sacrebleu.corpus_bleu(hyps, [refs], tokenize=tok) # 計算BLEU score stats["srcs"] = srcs stats["hyps"] = hyps stats["refs"] = refs # WandB logging(可選) if config.use_wandb and log_to_wandb: wandb.log({ "valid/loss": stats["loss"], "valid/bleu": stats["bleu"].score, }, commit=False) # 印出隨機一筆翻譯範例(用來 debug/展示) showid = np.random.randint(len(hyps)) logger.info("example source: " + srcs[showid]) logger.info("example hypothesis: " + hyps[showid]) logger.info("example reference: " + refs[showid]) # logger.info(f"validation loss:\t{stats['loss']:.4f}") logger.info(stats["bleu"].format()) return stats ``` Save and Load Model Weights ```= from pathlib import Path # 假設 config.savedir 是 '/workspace/checkpoints/rnn' config.savedir = '/workspace/checkpoints/rnn' # 如果 config.savedir 還沒設置 savedir = Path(config.savedir).absolute() # 確認路徑 print(savedir) # 創建資料夾(如果不存在) savedir.mkdir(parents=True, exist_ok=True) ``` ```= # 呼叫前面寫的 validate() 函數 def validate_and_save(model, task, criterion, optimizer, epoch, save=True): stats = validate(model, task, criterion) bleu = stats['bleu'] loss = stats['loss'] if save: # save epoch checkpoints config.savedir = '/workspace/checkpoints/rnn' savedir = Path(config.savedir).absolute() savedir.mkdir(parents=True, exist_ok=True) print(savedir) # 儲存 checkpoint,包括: # 模型參數 # 當前 epoch 的 BLEU 和 loss # 優化器的 step 數(這是你自訂的 _step) check = { "model": model.state_dict(), "stats": {"bleu": bleu.score, "loss": loss}, "optim": {"step": optimizer._step} } # 可選保留最新的模型或歷史版本 torch.save(check, savedir/f"checkpoint{epoch}.pt") shutil.copy(savedir/f"checkpoint{epoch}.pt", savedir/f"checkpoint_last.pt") logger.info(f"saved epoch checkpoint: {savedir}/checkpoint{epoch}.pt") # 儲存翻譯輸出(source + hypothesis)s with open(savedir/f"samples{epoch}.{config.source_lang}-{config.target_lang}.txt", "w") as f: for s, h in zip(stats["srcs"], stats["hyps"]): f.write(f"{s}\t{h}\n") # 儲存最佳模型(highest BLEU) if getattr(validate_and_save, "best_bleu", 0) < bleu.score: validate_and_save.best_bleu = bleu.score torch.save(check, savedir/f"checkpoint_best.pt") # 刪除舊模型(只保留最新 N 個) del_file = savedir / f"checkpoint{epoch - config.keep_last_epochs}.pt" if del_file.exists(): del_file.unlink() return stats def try_load_checkpoint(model, optimizer=None, name=None): name = name if name else "checkpoint_last.pt" checkpath = Path(config.savedir)/name # 載入上次訓練的模型狀態 if checkpath.exists(): check = torch.load(checkpath) model.load_state_dict(check["model"]) stats = check["stats"] step = "unknown" if optimizer != None: optimizer._step = step = check["optim"]["step"] logger.info(f"loaded checkpoint {checkpath}: step={step} loss={stats['loss']} bleu={stats['bleu']}") else: logger.info(f"no checkpoints found at {checkpath}!") ``` 【Main】 Training loop ```= model = model.to(device=device) criterion = criterion.to(device=device) ``` ```= # 設定用來確認訓練任務的基本資訊 logger.info("task: {}".format(task.__class__.__name__)) logger.info("encoder: {}".format(model.encoder.__class__.__name__)) logger.info("decoder: {}".format(model.decoder.__class__.__name__)) logger.info("criterion: {}".format(criterion.__class__.__name__)) logger.info("optimizer: {}".format(optimizer.__class__.__name__)) logger.info( "num. model params: {:,} (num. trained: {:,})".format( sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad), ) ) logger.info(f"max tokens per batch = {config.max_tokens}, accumulate steps = {config.accum_steps}") ``` ```= print(f"Start from epoch: {config.start_epoch}") print(f"Train until epoch: {config.max_epoch}") print(f"Total epochs: {config.max_epoch - config.start_epoch + 1}") ``` ![截圖 2025-05-04 下午4.23.09](https://hackmd.io/_uploads/ryfZNoElge.png) ```= # 開始訓練 epoch_itr = load_data_iterator(task, "train", config.start_epoch, config.max_tokens, config.num_workers) try_load_checkpoint(model, optimizer, name=config.resume) while epoch_itr.next_epoch_idx <= config.max_epoch: # train for one epoch train_one_epoch(epoch_itr, model, task, criterion, optimizer, config.accum_steps) stats = validate_and_save(model, task, criterion, optimizer, epoch=epoch_itr.epoch) logger.info("end of epoch {}".format(epoch_itr.epoch)) epoch_itr = load_data_iterator(task, "train", epoch_itr.next_epoch_idx, config.max_tokens, config.num_workers) ``` Submission ```= # 使用 Fairseq 官方提供的腳本來做「模型參數平均」,也就是將最近的 5 個 checkpoint 做平均,以提升模型穩定性和 BLEU 分數 checkdir=config.savedir !python ./fairseq/scripts/average_checkpoints.py \ --inputs {checkdir} \ --num-epoch-checkpoints 5 \ --output {checkdir}/avg_last_5_checkpoint.pt ``` Confirm model weights used to generate submission ```= # checkpoint_last.pt : latest epoch # checkpoint_best.pt : highest validation bleu # avg_last_5_checkpoint.pt: the average of last 5 epochs # config.savedir/avg_last_5_checkpoint.pt 載入模型參數 # 但這裡沒帶 optimizer,所以會忽略掉 optimizer 狀態,沒關係,如果只是 inference 沒問題。 try_load_checkpoint(model, name="avg_last_5_checkpoint.pt") validate(model, task, criterion, log_to_wandb=False) None ``` Generate Prediction ```= def generate_prediction(model, task, split="test", outfile="./prediction.txt"): task.load_dataset(split=split, epoch=1) itr = load_data_iterator(task, split, 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False) idxs = [] hyps = [] model.eval() progress = tqdm.tqdm(itr, desc=f"prediction") with torch.no_grad(): for i, sample in enumerate(progress): # validation loss sample = utils.move_to_cuda(sample, device=device) # do inference s, h, r = inference_step(sample, model) hyps.extend(h) idxs.extend(list(sample['id'])) # sort based on the order before preprocess hyps = [x for _,x in sorted(zip(idxs,hyps))] with open(outfile, "w") as f: for h in hyps: f.write(h+"\n") ``` ```= generate_prediction(model, task) ``` ```= with open(data_dir + '/' + dataset_name + '/new_test.clean.zh', 'r') as f: references = [line.strip() for line in f] with open('/workspace/prediction.txt', 'r') as f: predictions = [line.strip() for line in f] print(f"Number of reference sentences: {len(references)}") print(f"Number of predicted sentences: {len(predictions)}") ``` ![截圖 2025-05-04 下午4.56.56](https://hackmd.io/_uploads/r1uy2oNelg.png) ```= count = 0 for i, (ref, pred) in enumerate(zip(references, predictions)): if ref != pred: print(f"[{i}] REF: {ref}") print(f"[{i}] PRED: {pred}") print() count += 1 if count >= 10: break ``` ![截圖 2025-05-04 下午4.57.52](https://hackmd.io/_uploads/HJBQhjExxe.png) ```= !pip install sacrebleu import sacrebleu bleu = sacrebleu.corpus_bleu(predictions, [references]) print(f"BLEU score: {bleu.score:.2f}") ```