ML2022 HW

紀錄關鍵字&重點部分。後續須另外找資源和練習。

https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php
https://github.com/virginiakm1988/ML2022-Spring
https://www.kaggle.com/competitions/ml2022spring-hw1/overview
https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset

Week 1

  • Function with Unknown Params
  • Define Loss from Training data
    • mean absolute error (MAE)
    • mean sqaure error (MSE)
    • if y(estimate) & y_bar (ground_truth) are both probability distributions->would use cross-entropy
    • error surface
  • Optimization
    • gradient descent
    • local minimum vs global minimum
    • local minimum is a fake issueother issue in real case
    • linear model limination -> model bias
  • Piecewise linear curve
    • Approximate continuouus curve by a piecewise linear curve
    • real curve = constant + sum of a set of f(x), could be sigmoid function (Hard or soft sigmoid)
  • Optimization of New model (deel learning)
    • 1 update: see 1 batch update
    • 1 epoch: see all the batches once update
      (N=10000, B=10, 1 epoch=1000updates)
  • AlexNet(2012)
  • VGG(2014)
  • GoogleNet(2014)
  • Residual Net
  • Overfiiting

HW1

任務說明:

  • One hot coding
  • Mean Square Error (MSE)
  • fix random seed (幫助還原實驗 結果)
  • train_valid_split
  • predict (輸出csv)
  • optimizer-SGD

改善思路(參考)

  • train/valid dataset切分
    • 改為隨機切分
    • 資料集太小, 多切幾次避免異常值分到valid (try: k-fold) (first priority)
  • feature選取
    • 用pandas觀察feature分布
      • 如何評斷dataset分布?
    • 分析feature和target(day4是否確診)間的相關度
    • 選取相關度較大(corr>0.5)的34個feature做training
    • 選取相關度較大(corr>0.85)的24個feature做training
    • 結果: 原始model train loss~1.7, valid loss~2.2, loss高而且overfitting明顯
      選取相關度較大的feature做training corr>0.5 train loss~1.1, valid loss~1.2, loss和overfitting明顯改善
      進一步挑選corr>0.85的feature做training, train loss相較corr>0.5略微提升, valid loss則相同, 因此train/valid loss差距下降
    • highly correlated features selection
      • xgboost
      • only test_positive results
  • Loss function
    • 嘗試MSE, RMSE, MAE
    • MSE: 15k epoch, train/valid loss=1.113/1.119
    • RMSE: 16k epoch, train/valid loss=1.086/1.217
    • MAE: 16k epoch, train/valid loss=0.83/0.93
    • outlier?
    • 沒有align, evaluate model要選擇其中一種作比較
    • 不同loss function無法比較,因此先固定用MSE來比較models
  • L1/L2正則, 避免overfitting
    • 觀察train loss & valid loss差距
    • exp1: loss function: RMSE, optim:SGE + weight decay=0.01, train/valid loss=1.07/1.28, 差距相較未套weight decay的optim沒有顯著改善
    • outlier?
    • how to detect outlier/ search "outlier"
  • Network Structure
    • 增加層數
    • 改變寬度
  • Hyperparameters
    • 調整batch size
    • 降低learning rate (lower piority)
    • opitmizer: adam (?!!!)
  • Normalization
    • 用全部資料的mean和data做normalization
    • 要在哪裡修改??
    • batch normalization

實驗記錄
0707

  • 從Colab平台轉移到Desktop,GPU Quadro P1000/Cuda10.1
  • 安裝cuda, cuDNN, pytorch-gpu

0708
Simple baseline

  • sample code
  • public score=1.46290/ private score=1.49254

Medium baseline

  • feature extraction
  • 24 features (corr>0.85)
    • public score=0.977/ private score=1.04362
      (其實已經過strong baseline了!)
  • 4 features (前四天tested_positive)
    • public score=1.05383/ private score 1.14928
      (稍微差)

Strong baseline

  • Optimizer

    • SGD
      • weight_decay = 1e-4 (據說等同於L2 regularization)
      • wd參數怎麼設定?
      • SGD適合的learning rate?
    • Adam
      • 沒有太大improve
      • learning rate = 1e-4 (小於1e-4會train不動)
    • AdamW
      • 據說是較新的technique
  • Feature selection

    • 微調feature selection參數(之前多-1)
    • 目前用pandas corr()篩選
    • (pending)sklearn.feature_selection.SelectKBest
    • (pending)LightGBM
    • (pending)xgboost
  • Model structure

    • 改用淺層網路(3層,包含輸入層、輸出層)
    • (input_dim, 128) > LeakyReLU > (128, 1)
    • 隱藏層越大沒有比較好,怎麼決定隱藏層數目?
  • Batch size

    • <256 沒有看到明顯優點

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

參考資源

HW2

任務說明:

  • 語音音位辨識並且標註

  • 將原始音檔每10ms切一個長度為25ms的frame(會重疊)做MFCC特徵擷取

  • 梅爾倒頻譜MFC是可以用來表示短期音訊的頻譜

  • 倒頻譜Cepstrum的英文前四字是頻譜Spectrum前四字的相反

  • 梅爾倒頻譜係數MFCC是一組用來建立MFC的關鍵係數,也是語音的關鍵特徵,可以輸入到模型進行訓練

  • MFCC能考慮人耳對不同頻率有不同敏感度,建立一個39維的特徵

  • MFCC擷取輸入為原始語音訊號,輸出為MFCC特徵

  • 預強調(Pre-emphasis)作用是突顯高頻特徵

  • 漢明窗(Hamming Window)作用是增加音框連續性

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • MultiClassification

  • 音位是人類語言中能夠區別意義的最小聲音單位

  • Dataset

    • 已經做完MFCC的音檔,另外存成.pt檔。讀取後是Nx39的tensor
    • training file ID (train_split.txt)
    • training file ID + label (train_labels.txt)
    • testing file ID (test_split.txt)
  • Preprocess

    • 讀取training_spits.txt的data做合併
    • sample code提供concat_feat function。例如n_frames=3則會向前向後個抓一個frame合併。

實驗思路
語音辨識需要補的知識太多先跳過 Speech Recognition
Simple baseline

  • sample code

Medium baseline

  • Model structure
    • Narrower but deeper (Hidden_layer=6, hidden_dim=1024, epoch=10)

      • Train Acc: 0.803183 Loss: 0.597770 | Val Acc: 0.671314 loss: 1.180700
    • Wider but shallower (Hidden_layer=2, hidden_dim=1700, epoch=10)

      • Train Acc: 0.739729 Loss: 0.809540 | Val Acc: 0.678171 loss: 1.038599

      • 用比較深的網路training acc提高, loss下降,代表optimiization方向正確
        但是validation loss反而提高, 應該有overfitting
        不需要太深的網路就有效果,只需加大寬度
    • 加入dropout嘗試改善overfitting

    • Dropout(0.25), Hidden_layer=3, Hidden_dim=2048, batch_size=2048, lr=0.001, epoch=30

    • Train Acc: 0.724341 Loss: 0.865447 | Val Acc: 0.726264 loss: 0.873864


    • Overfitting有明顯改善,也過Medium了

Strong baseline

  • Model structure
    • Dropout(0.5),Hidden_layer=6,other params keep the same
    • 加入Batch normalization(2048)
    • Train Acc: 0.732883 Loss: 0.832678 | Val Acc: 0.747805 loss: 0.789694

    • 些許提升,離strong baseline還差一點
    • Dropout(0.25or 0.75), other params keep the same
    • 0.25 -> Train Acc: 0.864038 Loss: 0.386619 | Val Acc: 0.741175 loss: 0.997037

    • 發現網路架構有問題, 更改成FC->BN->ReLU->Dropout(0.5)
    • Switch BN & Dropout position
    • Train Acc: 0.737737 Loss: 0.817285 | Val Acc: 0.747595 loss: 0.791525

    • 原因是BN應該要加在FC之後,並且Relu要在BN之後(BN一個目的是讓Activation的輸入不要偏離太遠)
      Dropout也應該在BN之後,否則BN計算的input有一半會是0,平均和標準差會和whole training data有差異
      但結果調換順序結果沒太大影響

參考資料
語音辨識資料集- LibriSpeech ASR corpus
梅爾倒頻譜
MFCC
作業思路參考