ML Final Project day1

# ML Final Project day1 CNN MLP SVM GBM MLP(pixel) ## G1 * 把像素縮小為 20*20 但會有雜訊 * sol? * CNN preprocess - Features: Rows_cols_sum * 得到 20*20 pixel * column 看不太出來，只看得出黑或白（因為數字8為傾斜） * preprocess - Features: Convolutional Processing * 把圖片的 features 取出 ### Model Method * Convolutional Layer + ANN (?) * CNN cp feature 去做 * two kind of CNN structures * complex model 花費時間較多，精準度也較高（約差1%) * show comfusion matrix * 瓶頸：一直夾 layer 會導致精准度無法再往上升，解法如下（但我們沒做）： * 扭曲圖像旋轉，再把處理過後的圖片去 train * 一維圖像轉彩色？ * pixel value 之流向？ * 找到圖像的重心（？） * CNN featrue 做 RFC select features - extract top 50, 150, 300 to compare * SVM * C 的大小之影響&時間 * 28 * 28 做 Random Forest * preprocess 20 * 20 做 feature ### Ensemble Methods: bagging & boosting * 近代演算法跑出來都會有 96 97 精準度，因此想到用這個方法來看能否提升。 * 訓練過程不同 parallel:sequential * Subsample 不同（權重有無） * Way of ensemble Majority Vote?: 權重 estimators * Main purpose: 減少不同分類氣之間的不同預測：減少結果預測誤差 * Example Show model complexity and Error rate * GBM parameters * estimator = 50, split 500, depth 8, sqrt, subsample 0.7 * 28 * 28 of 0~1 * find out: row classifier 可以解決 row sum/column sum issue. * more classifier more accuracy * CNN model + layer 不一定保證提升 performance * 2 reference * ### Techer feedback * 不用列 training error 只要列validation * report & homework 差異 ## G2 ### preprocessing * Normalization * Standardozation ### 特徵降維（減少特徵增加訓練速度） * PCA Unsupervised * LDA Supervised - better * then training ### Method & Arcitecture (SVM difference) * Linear SVM * SVM with linear * SVM with RBF - Best and not take most higher time (ps.線性函數是非線性函數的子集合?) * Classifier, Accueacy, Take time (Discriminant&Generative Model's difference) * Naive Bayes 假設性強, 7萬筆資料（大），沒有完全配對成功可能導致效果差。嚴謹的特徵向量會因為降維而提升準確度。 * Logistic Reqression 迭代方式（效果好但時間長），降維會導致執行時間大幅提升？ * PCA, LDA 降維導致的影響差異 CNN * ....... ### Techer feedback * 報告都要有故事性（非專業的人都聽得懂），假設，是否會讓人願意把案子交給你。 * 數字會騙人，要有比較基準才有意義。（在某些方面呈現證據出來） * 從簡單的開始比，淘汰掉一些，在跟更好的比。 ## G3 ### Preprocessing * image biarization and resizing * Deskewing 解決字跡偏左斜或偏右 ### feature exxtraction * Histogram of Gradients descriptor * resize into 16 * 16 * Normalization * HOG ### SVM * 速度最快 * Linear, RBF ### Random Forest * OpenCV 3.3 * 準確率最低 ### CNN * Relu 神經元的消失使得速度較快 * Sigmoid * drop rate 提高後的差異 * 調整參數帶來的影響是有意義的嗎？（差異不大） * therefore, use GAN model 做出假資料來改善 CNN training：一開始只是好奇透過很像真的的假資料來 training 結果如何？ * Q:產生真實資料沒有的特徵（？）A: 以此dataset好像還好～ * Q: GAN 中加入干擾向量，使得分辨時會有錯的 label A: Testing 時是把假混真，這些問題也許會被稀釋掉～ * Q: 假資料的 label 怎麼來 A: 人工給 label * Q: 有沒有想過框現實生活中的 data~~~ * accuracy 98 * 有沒有辦法做假資料來騙過機器～ * data automation???? * 數字在中間～可以左右移動一下就是新的圖片了～～ ### Techer feedback * 可以探討資料移來移去