機器學習筆記-2

# 機器學習筆記-2 >[HackMD 使用教學](https://hackmd.io/c/tutorials-tw/%2Fs%2Ftutorials-tw) [快速打出 Emoji](https://github.com/ikatyang/emoji-cheat-sheet) ## 4.輔助技巧(Resampling) 使用時機 : 1. 沒有 Model 假設 2. 沒有 Closed form 3. Model selection : test-set prediction error 4. 創造 ensemble learning (Bagging-bootstrap) 5. 得到參數的標準差 6. 調整超參數(自己給的)，如 : leanring rate (影響收斂速度) 缺點 : 計算量大 ### 1. Cross-validation **估 test error 方法** **1. 調整 train error，考量複雜度**，如 : $C_p\;statistic、AIC、BIC$ 挑選變數 :pushpin: 變數越多一定能降 error，但可能是沒用的變數，上述將沒用的變數係數設為 0 **2. Validation-set approach** (做一次) ![](https://i.imgur.com/Ud9QAP2.png) 缺點 : 1. **高變異** : 只做一次，MSE可能忽大忽小 2. **高估 test error** : 只 fitting 一半的 Data，n 數量少 **3. Leave one out cross validation (LOOCV)** (一次一個，做 n 次) ![](https://i.imgur.com/4h75hhW.png) $$CV_{(n)} = \frac{1}{n}\sum^n_{i=1}MSE_i，where\;MSE_i = (y_i-\hat y_i)^2$$ :pushpin: 結果固定，因為每點都當過 Validation 缺點 : 1. 執行 n 次很浪費時間 2. 一些資料中呈現高方差 :low_brightness: LOOCV 有時很有用，但通常不足以改變數據 :low_brightness: 每個折疊的估計是高度相關的，因此它們的平均值可以有高方差 **特別例子** : Generalized CV $$CV_{(n)} = \frac{1}{n}\sum^n_{i=1}(\frac{y_i-\hat y_i}{1-h_i})^2，h_i\; is\;the\;leverage\;(槓桿)$$ 直接為 Linear、ridge regression & smoothing spline 的解 [LOOCV PPT p13](https://phonchi.github.io/nsysu-math524//static_files/presentations/05_Resampling_Methods.pdf) [Proof of LOOCV formula](https://stats.stackexchange.com/questions/164223/proof-of-loocv-formula?noredirect=1&lq=1) **4. K-fold** ==K = 5== ![](https://i.imgur.com/NNBC3Hp.png) $$CV_{(k)} = \sum^k_{i=1}\frac{n_i}{n}MSE_i，where\;MSE_i = \frac{\sum_{j\in C_i}(y_j-\hat y_j)^2}{n_i}$$ :pushpin: $n_i = \frac{n}{k} \Rightarrow \frac{1}{k}\sum^k_{i=1}MSE_i； k = n，等價於\;LOOCV$ :pushpin: 經驗 : K = 5 或 10 穩定 [K-fold 範例](https://github.com/niancigao/Kaggle_Tabular-Playground-Series---Sep-2021/blob/main/kaggle.2.ipynb) :::warning **偏差-方差權衡** LOOCV (k=n，min) : bias 小；High variance (因為模型長的都差不多，只差一個 validation) $\frac{k-1}{k}$ : bias 大；Low variance **分類問題** $$CV_{(k)} = \sum^k_{i=1}\frac{n_i}{n}Err_i，where\;Err_i = \sum_{j\in C_i} \frac{I(y_j\neq\hat y_j)}{n_i}$$ ::: :question: Data : $5\times5000$ 選 $100$ 變數時，為了避免 label / target leakage (洩漏) $\Rightarrow$ 挑 100 變數加 fit model 同時進行 :biohazard_sign: 因為第一步挑 100 變數就會用到 label 資訊，所以要一二步一起做 [統計學習_1003；31:30 min](https://www.youtube.com/watch?v=WIfR5gObyCw&list=PLHNZtBNWQ-86lQYdpRp3Xv2Mu2MSCoanV&index=2) :question: 如果該預處理依賴於數據（例如標準化、one-hot編碼） :biohazard_sign: 你應該在你的訓練數據上計算它，然後使用該計算中的參數以將其應用於您的驗證和測試數據 **先 fit 在 train set 上，==m.fit(x.tr,y.tr)==** **再 transform 到 validation 上，==m.transform(x.val,y.val)==** [統計學習_1003；35:50 min](https://www.youtube.com/watch?v=WIfR5gObyCw&list=PLHNZtBNWQ-86lQYdpRp3Xv2Mu2MSCoanV&index=2) - GroupKFold : group data，缺點 : 學太細 - StratifiedKFold : unbalance data，stratify (分層) : 確保比例 - ShuffleSplit : shuffle (洗牌) : 自己選 Validation、train 幾次 - TimeSeriesSplit : expandon、sliding(滑行) window，未來資訊一定是 Validation，過去預測未來 [統計學習_1003；41:00 min](https://www.youtube.com/watch?v=WIfR5gObyCw&list=PLHNZtBNWQ-86lQYdpRp3Xv2Mu2MSCoanV&index=2) ### 2. Bootstrap 1. flexible 2. power 3. Bootstrap C.I. 4. 估計參數的標準差母體 $\to$ 重抽，population $\downarrow$ 估現實 $\to$ 一組觀察資料，假設為母體 $\to$ 重抽，Estimated population :pushpin: Data 需足夠變異性 (Var 夠大)、n 夠多，每次抽 n 個，抽出放回 - **Block Bootstrap** 因為 **TimeSeries : $y_t = \alpha y_{t-1} + \beta y_{t-2}$**，不獨立 $\Rightarrow$ 3個一組，創造 group，抽 group 再接起來 :question: **For test error ?** :biohazard_sign: **不行**，因為會有相同資料 bootstrap : train & original sample : validation n 上升，約 $\frac{2}{3}$ 原始資料出現在 bootstrap [on average each bootstrap sample contain roughly two thirds of observations](https://stats.stackexchange.com/questions/88980/why-on-average-does-each-bootstrap-sample-contain-roughly-two-thirds-of-observat) :question: **For hypothesis ?** :biohazard_sign: **不行**，因為樣本分布當作母體，這樣抽樣失去意義 :biohazard_sign: hypothesis 中，我們想要的是 null-distribution，而 bootstrap 量化 the sampling distribution :biohazard_sign: 改用 **permutation test**，可用在假說檢定，為 null-distribution，也是一種 Resample 的方法 - **Permutation test** 也可在分類器樣本數不夠多時，對 n 小的分類亂排，這可以讓 n 的數量(觀察值)上升，再利用假說檢定檢查這個分類器真的有分出目標嗎