# ML2021Spring-hw4 解題筆記 ## 題目說明 VoxCeleb1 資料集 利用 Transformer Encoder 做語者辨識 69438 labeled data for training 切十分之一給 Validation 6000 unlabeled data for testing 可分成 600 個類別(600 位語者) ## 成果 ![Screenshot_20240207_103458](https://hackmd.io/_uploads/rymD3Dli6.png) Passed Strong Baseline (Pravate Strong:0.95333、Public Strong:0.95404) ## 做了什麼? 先看原本的 Network: ```python= class Classifier(nn.Module): def __init__(self, d_model=80, n_spks=600, dropout=0.1): super().__init__() self.prenet = nn.Linear(40, d_model) self.encoder_layer = nn.TransformerEncoderLayer( d_model=d_model, dim_feedforward=256, nhead=2 ) # self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=2) self.pred_layer = nn.Sequential( nn.Linear(d_model, d_model), nn.ReLU(), nn.Linear(d_model, n_spks), ) def forward(self, mels): out = self.prenet(mels) out = out.permute(1, 0, 2) out = self.encoder_layer(out) out = out.transpose(0, 1) # mean pooling stats = out.mean(dim=1) # out: (batch, n_spks) out = self.pred_layer(stats) return out ``` 助教講解中提示:Model 可能太複雜了,可以試著減少 d_model 和 multihead attention 的 head 數,甚至可以讓 Linear 更簡單 但我一開始嘗試的時候反其道而行,將 `d_model` 改為 100,`nhead` 改為 4 並且在 Linear Layers 中間加入 BatchNorm,也將 encoder layer 疊成兩層 變成以下這個樣子: ```python= class Classifier(nn.Module): # TODO: d_model def __init__(self, d_model=100, n_spks=600, dropout=0.1): super().__init__() # Project the dimension of features from that of input into d_model. self.prenet = nn.Linear(40, d_model) self.encoder_layer = nn.TransformerEncoderLayer( d_model=d_model, dim_feedforward=256, nhead=4 ) self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=2) self.pred_layer = nn.Sequential( nn.Linear(d_model, d_model), nn.BatchNorm1d(d_model), nn.ReLU(), nn.Linear(d_model, n_spks), ) def forward(self, mels): # out: (batch size, length, d_model) out = self.prenet(mels) # out: (length, batch size, d_model) out = out.permute(1, 0, 2) # The encoder layer expect features in the shape of (length, batch size, d_model). out = self.encoder(out) # out: (batch size, length, d_model) out = out.transpose(0, 1) # mean pooling stats = out.mean(dim=1) # out: (batch, n_spks) out = self.pred_layer(stats) return out ``` 這時就通過 medium baseline 了,而且離 strong baseline 不遠 觀察一下訓練時的數據,發現 Train Accuracy 與 Validate Accuracy 相差不遠 Train Accuracy 還沒達到 100%,我覺得還有機會讓 Model 進步 於是我修改了一下訓練步驟,在訓練前先讀取上一次儲存最好的 Model: ```python= # load latest best model model.load_state_dict(torch.load('model.ckpt')) print(f"[Info]: Loaded the trained model.", flush=True) ``` 並且加上選項,Train 到最後時可以選擇是否要停止: ```python= if (step + 1) == total_steps: yn = input("\nContinue Training? [y/n] ") if yn == 'y': total_steps += 10000 ``` 經過好幾次重複用目前最好的 Model 狂 Train 好幾次後,就通過 Strong baseline 了 ***大概是目前最快得到好結果的一次*** Note:這題有 learning rate warm up,所以重新 load 進來 Train 的最一開始時,Accuracy 會稍微比上一輪儲存時還要低,但是 Train 到最後的成果會比上一輪好,而且也比按 y 繼續 Train 來得有效。