# ML2022 HW 紀錄關鍵字&重點部分。後續須另外找資源和練習。 https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php https://github.com/virginiakm1988/ML2022-Spring https://www.kaggle.com/competitions/ml2022spring-hw1/overview https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset [toc] ## Week 1 - Function with Unknown Params - Define Loss from Training data - mean absolute error (MAE) - mean sqaure error (MSE) - if y(estimate) & y_bar (ground_truth) are both probability distributions->would use cross-entropy - error surface - Optimization - gradient descent - local minimum vs global minimum - local minimum is a fake issue....other issue in real case - linear model limination -> model bias - Piecewise linear curve - Approximate continuouus curve by a piecewise linear curve - real curve = constant + sum of a set of f(x), could be sigmoid function (Hard or soft sigmoid) - Optimization of New model (deel learning) - 1 update: see 1 batch update - 1 epoch: see all the batches once update (N=10000, B=10, 1 epoch=1000updates) - AlexNet(2012) - VGG(2014) - GoogleNet(2014) - Residual Net - Overfiiting ### HW1 任務說明: - One hot coding - Mean Square Error (MSE) - fix random seed (幫助還原實驗 結果) - train_valid_split - predict (輸出csv) - optimizer-SGD 改善思路([參考](https://github.com/1am9trash/HUNG_YI_LEE_ML_2021/blob/main/hw/hw1/hw1_code.ipynb)) - train/valid dataset切分 - 改為隨機切分 - 資料集太小, 多切幾次避免異常值分到valid (try: k-fold) (first priority) - feature選取 - 用pandas觀察feature分布 - 如何評斷dataset分布? - 分析feature和target(day4是否確診)間的相關度 - 選取相關度較大(corr>0.5)的34個feature做training - 選取相關度較大(corr>0.85)的24個feature做training - 結果: 原始model train loss~1.7, valid loss~2.2, loss高而且overfitting明顯 選取相關度較大的feature做training corr>0.5 train loss~1.1, valid loss~1.2, loss和overfitting明顯改善 進一步挑選corr>0.85的feature做training, train loss相較corr>0.5略微提升, valid loss則相同, 因此train/valid loss差距下降 - highly correlated features selection - xgboost - only test_positive results - Loss function - 嘗試MSE, RMSE, MAE - MSE: 15k epoch, train/valid loss=1.113/1.119 - RMSE: 16k epoch, train/valid loss=1.086/1.217 - MAE: 16k epoch, train/valid loss=0.83/0.93 - outlier? - 沒有align, evaluate model要選擇其中一種作比較 - 不同loss function無法比較,因此先固定用MSE來比較models - L1/L2正則, 避免overfitting - 觀察train loss & valid loss差距 - exp1: loss function: RMSE, optim:SGE + weight decay=0.01, train/valid loss=1.07/1.28, 差距相較未套weight decay的optim沒有顯著改善 - outlier? - how to detect outlier/ search "outlier" - Network Structure - 增加層數 - 改變寬度 - Hyperparameters - 調整batch size - 降低learning rate (lower piority) - opitmizer: adam (?!!!) - Normalization - 用全部資料的mean和data做normalization - 要在哪裡修改?? - batch normalization 實驗記錄 0707 - 從Colab平台轉移到Desktop,GPU Quadro P1000/Cuda10.1 - 安裝cuda, cuDNN, pytorch-gpu 0708 Simple baseline - sample code - public score=1.46290/ private score=1.49254 Medium baseline - feature extraction - 24 features (corr>0.85) - public score=0.977/ private score=1.04362 (其實已經過strong baseline了!) - 4 features (前四天tested_positive) - public score=1.05383/ private score 1.14928 (稍微差) Strong baseline - Optimizer - SGD - weight_decay = 1e-4 (據說等同於L2 regularization) - wd參數怎麼設定? - SGD適合的learning rate? - Adam - 沒有太大improve - learning rate = 1e-4 (小於1e-4會train不動) - AdamW - 據說是較新的technique - Feature selection - 微調feature selection參數(之前多-1) - 目前用pandas corr()篩選 - (pending)sklearn.feature_selection.SelectKBest - (pending)LightGBM - (pending)xgboost - Model structure - 改用淺層網路(3層,包含輸入層、輸出層) - (input_dim, 128) --> LeakyReLU --> (128, 1) - 隱藏層越大沒有比較好,怎麼決定隱藏層數目? - Batch size - <256 沒有看到明顯優點 ![](https://i.imgur.com/iP1HsUj.png) 參考資源 - [作業思路參考](https://github.com/Joshuaoneheart/ML2022_all_A_plus) - [Attention is All you need](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) ## HW2 任務說明: - 語音音位辨識並且標註 - 將原始音檔每10ms切一個長度為25ms的frame(會重疊)做MFCC特徵擷取 - 梅爾倒頻譜MFC是可以用來表示短期音訊的頻譜 - 倒頻譜Cepstrum的英文前四字是頻譜Spectrum前四字的相反 - 梅爾倒頻譜係數MFCC是一組用來建立MFC的關鍵係數,也是語音的關鍵特徵,可以輸入到模型進行訓練 - MFCC能考慮人耳對不同頻率有不同敏感度,建立一個39維的特徵 - MFCC擷取輸入為原始語音訊號,輸出為MFCC特徵 - 預強調(Pre-emphasis)作用是突顯高頻特徵 - 漢明窗(Hamming Window)作用是增加音框連續性 ![](https://i.imgur.com/x9lFZjq.png) ![](https://i.imgur.com/mdGh0T4.png) - MultiClassification - 音位是人類語言中能夠區別意義的最小聲音單位 - Dataset - 已經做完MFCC的音檔,另外存成.pt檔。讀取後是Nx39的tensor - training file ID (train_split.txt) - training file ID + label (train_labels.txt) - testing file ID (test_split.txt) - Preprocess - 讀取training_spits.txt的data做合併 - sample code提供concat_feat function。例如n_frames=3則會向前向後個抓一個frame合併。 實驗思路 語音辨識需要補的知識太多...先跳過 [Speech Recognition](/qGC3NzKnT-yoNWe1IauqqQ) Simple baseline - sample code Medium baseline - Model structure - Narrower but deeper (Hidden_layer=6, hidden_dim=1024, epoch=10) - Train Acc: 0.803183 Loss: 0.597770 | Val Acc: 0.671314 loss: 1.180700 - ![](https://i.imgur.com/SUSS4ko.png) - Wider but shallower (Hidden_layer=2, hidden_dim=1700, epoch=10) - Train Acc: 0.739729 Loss: 0.809540 | Val Acc: 0.678171 loss: 1.038599 - ![](https://i.imgur.com/hnWrO1K.png) 用比較深的網路training acc提高, loss下降,代表optimiization方向正確 但是validation loss反而提高, 應該有overfitting 不需要太深的網路就有效果,只需加大寬度 - 加入dropout嘗試改善overfitting - Dropout(0.25), Hidden_layer=3, Hidden_dim=2048, batch_size=2048, lr=0.001, epoch=30 - Train Acc: 0.724341 Loss: 0.865447 | Val Acc: 0.726264 loss: 0.873864 - ![](https://i.imgur.com/2aaspyB.png) Overfitting有明顯改善,也過Medium了 Strong baseline - Model structure - Dropout(0.5),Hidden_layer=6,other params keep the same - 加入Batch normalization(2048) - Train Acc: 0.732883 Loss: 0.832678 | Val Acc: 0.747805 loss: 0.789694 - ![](https://i.imgur.com/35ELtFS.png) 些許提升,離strong baseline還差一點 - Dropout(0.25or 0.75), other params keep the same - 0.25 -> Train Acc: 0.864038 Loss: 0.386619 | Val Acc: 0.741175 loss: 0.997037 - ![](https://i.imgur.com/yq5yK3o.png) 發現網路架構有問題, 更改成FC->BN->ReLU->Dropout(0.5) - Switch BN & Dropout position - Train Acc: 0.737737 Loss: 0.817285 | Val Acc: 0.747595 loss: 0.791525 - ![](https://i.imgur.com/soIF7JR.png) 原因是BN應該要加在FC之後,並且Relu要在BN之後(BN一個目的是讓Activation的輸入不要偏離太遠) Dropout也應該在BN之後,否則BN計算的input有一半會是0,平均和標準差會和whole training data有差異 但結果...調換順序結果沒太大影響 參考資料 [語音辨識資料集- LibriSpeech ASR corpus](https://www.openslr.org/12/) [梅爾倒頻譜](https://zh.wikipedia.org/zh-tw/%E6%A2%85%E7%88%BE%E5%80%92%E9%A0%BB%E8%AD%9C) [MFCC](https://ithelp.ithome.com.tw/m/articles/10267054) [作業思路參考](https://www.bilibili.com/video/BV1Fq4y137YL?spm_id_from=333.999.0.0)