ML2022 HW

紀錄關鍵字&重點部分。後續須另外找資源和練習。

https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php
https://github.com/virginiakm1988/ML2022-Spring
https://www.kaggle.com/competitions/ml2022spring-hw1/overview
https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset

ML2022 HW
- Week 1
  - HW1
- HW2

Week 1

Function with Unknown Params
Define Loss from Training data
- mean absolute error (MAE)
- mean sqaure error (MSE)
- if y(estimate) & y_bar (ground_truth) are both probability distributions->would use cross-entropy
- error surface
Optimization
- gradient descent
- local minimum vs global minimum
- local minimum is a fake issue…other issue in real case
- linear model limination -> model bias
Piecewise linear curve
- Approximate continuouus curve by a piecewise linear curve
- real curve = constant + sum of a set of f(x), could be sigmoid function (Hard or soft sigmoid)
Optimization of New model (deel learning)
- 1 update: see 1 batch update
- 1 epoch: see all the batches once update
  (N=10000, B=10, 1 epoch=1000updates)
AlexNet(2012)
VGG(2014)
GoogleNet(2014)
Residual Net
Overfiiting

HW1

任務說明:

One hot coding
Mean Square Error (MSE)
fix random seed (幫助還原實驗結果)
train_valid_split
predict (輸出csv)
optimizer-SGD

改善思路(參考)

train/valid dataset切分
- 改為隨機切分
- 資料集太小, 多切幾次避免異常值分到valid (try: k-fold) (first priority)
feature選取
- 用pandas觀察feature分布
  - 如何評斷dataset分布?
- 分析feature和target(day4是否確診)間的相關度
- 選取相關度較大(corr>0.5)的34個feature做training
- 選取相關度較大(corr>0.85)的24個feature做training
- 結果: 原始model train loss~1.7, valid loss~2.2, loss高而且overfitting明顯
  選取相關度較大的feature做training corr>0.5 train loss~1.1, valid loss~1.2, loss和overfitting明顯改善
  進一步挑選corr>0.85的feature做training, train loss相較corr>0.5略微提升, valid loss則相同, 因此train/valid loss差距下降
- highly correlated features selection
  - xgboost
  - only test_positive results
Loss function
- 嘗試MSE, RMSE, MAE
- MSE: 15k epoch, train/valid loss=1.113/1.119
- RMSE: 16k epoch, train/valid loss=1.086/1.217
- MAE: 16k epoch, train/valid loss=0.83/0.93
- outlier?
- 沒有align, evaluate model要選擇其中一種作比較
- 不同loss function無法比較，因此先固定用MSE來比較models
L1/L2正則, 避免overfitting
- 觀察train loss & valid loss差距
- exp1: loss function: RMSE, optim:SGE + weight decay=0.01, train/valid loss=1.07/1.28, 差距相較未套weight decay的optim沒有顯著改善
- outlier?
- how to detect outlier/ search "outlier"
Network Structure
- 增加層數
- 改變寬度
Hyperparameters
- 調整batch size
- 降低learning rate (lower piority)
- opitmizer: adam (?!!!)
Normalization
- 用全部資料的mean和data做normalization
- 要在哪裡修改??
- batch normalization

實驗記錄
0707

從Colab平台轉移到Desktop，GPU Quadro P1000/Cuda10.1
安裝cuda, cuDNN, pytorch-gpu

0708
Simple baseline

sample code
public score=1.46290/ private score=1.49254

Medium baseline

feature extraction
24 features (corr>0.85)
- public score=0.977/ private score=1.04362
  (其實已經過strong baseline了!)
4 features (前四天tested_positive)
- public score=1.05383/ private score 1.14928
  (稍微差)

Strong baseline

Optimizer
- SGD
  - weight_decay = 1e-4 (據說等同於L2 regularization)
  - wd參數怎麼設定?
  - SGD適合的learning rate?
- Adam
  - 沒有太大improve
  - learning rate = 1e-4 (小於1e-4會train不動)
- AdamW
  - 據說是較新的technique
Feature selection
- 微調feature selection參數(之前多-1)
- 目前用pandas corr()篩選
- (pending)sklearn.feature_selection.SelectKBest
- (pending)LightGBM
- (pending)xgboost
Model structure
- 改用淺層網路(3層，包含輸入層、輸出層)
- (input_dim, 128) –> LeakyReLU –> (128, 1)
- 隱藏層越大沒有比較好，怎麼決定隱藏層數目?
Batch size
- <256 沒有看到明顯優點

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

參考資源

HW2

任務說明：

語音音位辨識並且標註
將原始音檔每10ms切一個長度為25ms的frame(會重疊)做MFCC特徵擷取
梅爾倒頻譜MFC是可以用來表示短期音訊的頻譜
倒頻譜Cepstrum的英文前四字是頻譜Spectrum前四字的相反
梅爾倒頻譜係數MFCC是一組用來建立MFC的關鍵係數，也是語音的關鍵特徵，可以輸入到模型進行訓練
MFCC能考慮人耳對不同頻率有不同敏感度，建立一個39維的特徵
MFCC擷取輸入為原始語音訊號，輸出為MFCC特徵
預強調(Pre-emphasis)作用是突顯高頻特徵
漢明窗(Hamming Window)作用是增加音框連續性
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
MultiClassification
音位是人類語言中能夠區別意義的最小聲音單位
Dataset
- 已經做完MFCC的音檔，另外存成.pt檔。讀取後是Nx39的tensor
- training file ID (train_split.txt)
- training file ID + label (train_labels.txt)
- testing file ID (test_split.txt)
Preprocess
- 讀取training_spits.txt的data做合併
- sample code提供concat_feat function。例如n_frames=3則會向前向後個抓一個frame合併。

實驗思路
語音辨識需要補的知識太多…先跳過 Speech Recognition
Simple baseline

sample code

Medium baseline

Model structure
- Narrower but deeper (Hidden_layer=6, hidden_dim=1024, epoch=10)
  - Train Acc: 0.803183 Loss: 0.597770 | Val Acc: 0.671314 loss: 1.180700
- Wider but shallower (Hidden_layer=2, hidden_dim=1700, epoch=10)
  - Train Acc: 0.739729 Loss: 0.809540 | Val Acc: 0.678171 loss: 1.038599
  - 用比較深的網路training acc提高, loss下降，代表optimiization方向正確
    但是validation loss反而提高, 應該有overfitting
    不需要太深的網路就有效果，只需加大寬度
- 加入dropout嘗試改善overfitting
- Dropout(0.25), Hidden_layer=3, Hidden_dim=2048, batch_size=2048, lr=0.001, epoch=30
- Train Acc: 0.724341 Loss: 0.865447 | Val Acc: 0.726264 loss: 0.873864
- Overfitting有明顯改善，也過Medium了

Strong baseline

Model structure
- Dropout(0.5),Hidden_layer=6,other params keep the same
- 加入Batch normalization(2048)
- Train Acc: 0.732883 Loss: 0.832678 | Val Acc: 0.747805 loss: 0.789694
- 些許提升，離strong baseline還差一點
- Dropout(0.25or 0.75), other params keep the same
- 0.25 -> Train Acc: 0.864038 Loss: 0.386619 | Val Acc: 0.741175 loss: 0.997037
- 發現網路架構有問題, 更改成FC->BN->ReLU->Dropout(0.5)
- Switch BN & Dropout position
- Train Acc: 0.737737 Loss: 0.817285 | Val Acc: 0.747595 loss: 0.791525
- 原因是BN應該要加在FC之後，並且Relu要在BN之後(BN一個目的是讓Activation的輸入不要偏離太遠)
  Dropout也應該在BN之後，否則BN計算的input有一半會是0，平均和標準差會和whole training data有差異
  但結果…調換順序結果沒太大影響

參考資料
語音辨識資料集- LibriSpeech ASR corpus
梅爾倒頻譜
 MFCC
作業思路參考

ML2022 HW

Week 1

HW1

HW2

Read more

讀書筆記

Model訓練Tips

Speech Recognition

數學