# ML2022 HW
紀錄關鍵字&重點部分。後續須另外找資源和練習。
https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php
https://github.com/virginiakm1988/ML2022-Spring
https://www.kaggle.com/competitions/ml2022spring-hw1/overview
https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset
[toc]
## Week 1
- Function with Unknown Params
- Define Loss from Training data
- mean absolute error (MAE)
- mean sqaure error (MSE)
- if y(estimate) & y_bar (ground_truth) are both probability distributions->would use cross-entropy
- error surface
- Optimization
- gradient descent
- local minimum vs global minimum
- local minimum is a fake issue....other issue in real case
- linear model limination -> model bias
- Piecewise linear curve
- Approximate continuouus curve by a piecewise linear curve
- real curve = constant + sum of a set of f(x), could be sigmoid function (Hard or soft sigmoid)
- Optimization of New model (deel learning)
- 1 update: see 1 batch update
- 1 epoch: see all the batches once update
(N=10000, B=10, 1 epoch=1000updates)
- AlexNet(2012)
- VGG(2014)
- GoogleNet(2014)
- Residual Net
- Overfiiting
### HW1
任務說明:
- One hot coding
- Mean Square Error (MSE)
- fix random seed (幫助還原實驗 結果)
- train_valid_split
- predict (輸出csv)
- optimizer-SGD
改善思路([參考](https://github.com/1am9trash/HUNG_YI_LEE_ML_2021/blob/main/hw/hw1/hw1_code.ipynb))
- train/valid dataset切分
- 改為隨機切分
- 資料集太小, 多切幾次避免異常值分到valid (try: k-fold) (first priority)
- feature選取
- 用pandas觀察feature分布
- 如何評斷dataset分布?
- 分析feature和target(day4是否確診)間的相關度
- 選取相關度較大(corr>0.5)的34個feature做training
- 選取相關度較大(corr>0.85)的24個feature做training
- 結果: 原始model train loss~1.7, valid loss~2.2, loss高而且overfitting明顯
選取相關度較大的feature做training corr>0.5 train loss~1.1, valid loss~1.2, loss和overfitting明顯改善
進一步挑選corr>0.85的feature做training, train loss相較corr>0.5略微提升, valid loss則相同, 因此train/valid loss差距下降
- highly correlated features selection
- xgboost
- only test_positive results
- Loss function
- 嘗試MSE, RMSE, MAE
- MSE: 15k epoch, train/valid loss=1.113/1.119
- RMSE: 16k epoch, train/valid loss=1.086/1.217
- MAE: 16k epoch, train/valid loss=0.83/0.93
- outlier?
- 沒有align, evaluate model要選擇其中一種作比較
- 不同loss function無法比較,因此先固定用MSE來比較models
- L1/L2正則, 避免overfitting
- 觀察train loss & valid loss差距
- exp1: loss function: RMSE, optim:SGE + weight decay=0.01, train/valid loss=1.07/1.28, 差距相較未套weight decay的optim沒有顯著改善
- outlier?
- how to detect outlier/ search "outlier"
- Network Structure
- 增加層數
- 改變寬度
- Hyperparameters
- 調整batch size
- 降低learning rate (lower piority)
- opitmizer: adam (?!!!)
- Normalization
- 用全部資料的mean和data做normalization
- 要在哪裡修改??
- batch normalization
實驗記錄
0707
- 從Colab平台轉移到Desktop,GPU Quadro P1000/Cuda10.1
- 安裝cuda, cuDNN, pytorch-gpu
0708
Simple baseline
- sample code
- public score=1.46290/ private score=1.49254
Medium baseline
- feature extraction
- 24 features (corr>0.85)
- public score=0.977/ private score=1.04362
(其實已經過strong baseline了!)
- 4 features (前四天tested_positive)
- public score=1.05383/ private score 1.14928
(稍微差)
Strong baseline
- Optimizer
- SGD
- weight_decay = 1e-4 (據說等同於L2 regularization)
- wd參數怎麼設定?
- SGD適合的learning rate?
- Adam
- 沒有太大improve
- learning rate = 1e-4 (小於1e-4會train不動)
- AdamW
- 據說是較新的technique
- Feature selection
- 微調feature selection參數(之前多-1)
- 目前用pandas corr()篩選
- (pending)sklearn.feature_selection.SelectKBest
- (pending)LightGBM
- (pending)xgboost
- Model structure
- 改用淺層網路(3層,包含輸入層、輸出層)
- (input_dim, 128) --> LeakyReLU --> (128, 1)
- 隱藏層越大沒有比較好,怎麼決定隱藏層數目?
- Batch size
- <256 沒有看到明顯優點
![](https://i.imgur.com/iP1HsUj.png)
參考資源
- [作業思路參考](https://github.com/Joshuaoneheart/ML2022_all_A_plus)
- [Attention is All you need](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
## HW2
任務說明:
- 語音音位辨識並且標註
- 將原始音檔每10ms切一個長度為25ms的frame(會重疊)做MFCC特徵擷取
- 梅爾倒頻譜MFC是可以用來表示短期音訊的頻譜
- 倒頻譜Cepstrum的英文前四字是頻譜Spectrum前四字的相反
- 梅爾倒頻譜係數MFCC是一組用來建立MFC的關鍵係數,也是語音的關鍵特徵,可以輸入到模型進行訓練
- MFCC能考慮人耳對不同頻率有不同敏感度,建立一個39維的特徵
- MFCC擷取輸入為原始語音訊號,輸出為MFCC特徵
- 預強調(Pre-emphasis)作用是突顯高頻特徵
- 漢明窗(Hamming Window)作用是增加音框連續性
![](https://i.imgur.com/x9lFZjq.png)
![](https://i.imgur.com/mdGh0T4.png)
- MultiClassification
- 音位是人類語言中能夠區別意義的最小聲音單位
- Dataset
- 已經做完MFCC的音檔,另外存成.pt檔。讀取後是Nx39的tensor
- training file ID (train_split.txt)
- training file ID + label (train_labels.txt)
- testing file ID (test_split.txt)
- Preprocess
- 讀取training_spits.txt的data做合併
- sample code提供concat_feat function。例如n_frames=3則會向前向後個抓一個frame合併。
實驗思路
語音辨識需要補的知識太多...先跳過 [Speech Recognition](/qGC3NzKnT-yoNWe1IauqqQ)
Simple baseline
- sample code
Medium baseline
- Model structure
- Narrower but deeper (Hidden_layer=6, hidden_dim=1024, epoch=10)
- Train Acc: 0.803183 Loss: 0.597770 | Val Acc: 0.671314 loss: 1.180700
- ![](https://i.imgur.com/SUSS4ko.png)
- Wider but shallower (Hidden_layer=2, hidden_dim=1700, epoch=10)
- Train Acc: 0.739729 Loss: 0.809540 | Val Acc: 0.678171 loss: 1.038599
- ![](https://i.imgur.com/hnWrO1K.png)
用比較深的網路training acc提高, loss下降,代表optimiization方向正確
但是validation loss反而提高, 應該有overfitting
不需要太深的網路就有效果,只需加大寬度
- 加入dropout嘗試改善overfitting
- Dropout(0.25), Hidden_layer=3, Hidden_dim=2048, batch_size=2048, lr=0.001, epoch=30
- Train Acc: 0.724341 Loss: 0.865447 | Val Acc: 0.726264 loss: 0.873864
- ![](https://i.imgur.com/2aaspyB.png)
Overfitting有明顯改善,也過Medium了
Strong baseline
- Model structure
- Dropout(0.5),Hidden_layer=6,other params keep the same
- 加入Batch normalization(2048)
- Train Acc: 0.732883 Loss: 0.832678 | Val Acc: 0.747805 loss: 0.789694
- ![](https://i.imgur.com/35ELtFS.png)
些許提升,離strong baseline還差一點
- Dropout(0.25or 0.75), other params keep the same
- 0.25 -> Train Acc: 0.864038 Loss: 0.386619 | Val Acc: 0.741175 loss: 0.997037
- ![](https://i.imgur.com/yq5yK3o.png)
發現網路架構有問題, 更改成FC->BN->ReLU->Dropout(0.5)
- Switch BN & Dropout position
- Train Acc: 0.737737 Loss: 0.817285 | Val Acc: 0.747595 loss: 0.791525
- ![](https://i.imgur.com/soIF7JR.png)
原因是BN應該要加在FC之後,並且Relu要在BN之後(BN一個目的是讓Activation的輸入不要偏離太遠)
Dropout也應該在BN之後,否則BN計算的input有一半會是0,平均和標準差會和whole training data有差異
但結果...調換順序結果沒太大影響
參考資料
[語音辨識資料集- LibriSpeech ASR corpus](https://www.openslr.org/12/)
[梅爾倒頻譜](https://zh.wikipedia.org/zh-tw/%E6%A2%85%E7%88%BE%E5%80%92%E9%A0%BB%E8%AD%9C)
[MFCC](https://ithelp.ithome.com.tw/m/articles/10267054)
[作業思路參考](https://www.bilibili.com/video/BV1Fq4y137YL?spm_id_from=333.999.0.0)