Machine Learning & Security

# Machine Learning & Security - [**ppt講義(簡單版)**](https://drive.google.com/file/d/1rFFmOF7BfIU0KCiiEnDHQA1Mts3evXIj/view) - [**ppt講義(進階版)**](https://drive.google.com/drive/folders/1J7WuuxGKE0oVle5WU7cnXvCq9eQfkkAl) - [**interpertability**](https://drive.google.com/drive/folders/1J7WuuxGKE0oVle5WU7cnXvCq9eQfkkAl) - [**AutoML**](https://drive.google.com/drive/folders/1J7WuuxGKE0oVle5WU7cnXvCq9eQfkkAl) - [**HackMD**](https://hackmd.io/JvssQTwyRIej5zq6JEUCVQ) **講師**: 奧義智慧科技 **陳仲寬** [**FaceBook:陳仲寬**](https://www.facebook.com/Bletchley13) **Gmail:** bletchley13@gmail.com --- [TOC] ## recommand books [大演算](https://www.books.com.tw/products/0010722761) [Machine Learning & Security](https://www.books.com.tw/products/F014109393) ## Docker init 簡單的VM(輕量版w) [docker hub](https://hub.docker.com/) Pull docker mlsec image ``` sudo docker pull bletchley/mlsec:taiwanno1 ``` 下載好Ubuntu，並裝好docker後，執行(約6G大小) ``` sudo docker run -p 8888:8888 -it bletchley/mlsec:taiwanno1 bash ``` 下載完後，進到 docker 輸入 ``` jupyter notebook --allow-root ``` 帶有 token 的網址丟進瀏覽器，網址的部分要修改一下將 http://(`container_id` or `127.0.0.1`):8888 => http://127.0.0.1:8888 如果要確認 docker 現在有開哪些服務，輸入 ``` sudo docker ps ``` --- ## Goals - Understand fundamental concept of machine learning - Realize different machine learning schools - A detail example of network traffic analysis - Several new research related to machine learning and traffic analysis ## Machine Learning ### Definition ``` “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, improves with experience E.” - Mitchell, 1997 ``` ### Data Engineering pharse 如何用E(Experience)提升P(Performance) ![流程](https://i.imgur.com/M9TUXtn.png) - Data mining(資料探勘) 尋找所需的資料 - Data exploration(數據探勘) 對於資料先初步將不合理資料篩除 - Feature generation(特徵生成) 針對剩下的資料擷取特徵 - Feture selection(特徵擷取) 了解Data後，找出好的feature - Cross validation(交叉驗證) 將資料分成多個部分，進行比較，可避免過擬合的發生 - Training data(訓練資料) - Test data(測試資料) ### Model Training pharse ![流程](https://i.imgur.com/AqJocsT.png) - Model selection(選擇模型) - Training data(訓練資料) - Model training(模型訓練) - Model tuning(模型校正) - Resulting model(最終模型) ### Model Validation pharse ![流程](https://i.imgur.com/2N6wicJ.png) - Test data(測試資料) - Ground truth(期望答案) - Resulting model(最終模型) - Evaluate(評估) ## Model 可以想像成一個最佳化的function 輸入後會得到最佳解 ### Any machine learning algorithm consists of - A model family/representation, which describes the universe of models from which we can choose - A loss function, which allow us to quantitatively compare different models - An optimization procedure, where allow us to improve/find the model based according to loss function.  ### Cross-validation 資料會分成**Test Data**和**Training Data** 如果Training Data 和 Test Data 篩選不夠好，會造成ML結果不佳這裡的交叉驗證是使用**k-fold validation** 將訓練集分割成k個子樣本，一個單獨的子樣本被保留作為驗證模型的數據，其他k − 1個樣本用來訓練每個子樣本都驗證一次，平均結果獲得一個單一值適合用在資料數量少 ``` - No enough data - K part, K-1 for training, 1 for testing - Rotate every round - Average performance ``` ![](https://i.imgur.com/qNk5Dmx.png) ### Train, validate, test 將資料直接分類成Train data和Test data，然後進行訓練 ``` - Enough data - E.g. 60% for training, 20% for validation, 20% for testing - Validate – find the best model ``` ### Out-of-time validation 資料跟時間相關，無法一開始就知道有哪些資料 > 像是Twitter，每天都會更新資料上去，其中分成兩種訓練方式 1. 先訓練出模型，將資料直接分成Validate和Test 2. 先訓練模型，先集中資料(像事先蒐集3天份的資料)，最後再進一步Test ![](https://i.imgur.com/Qa7Pp5g.png) ## Evaluate Performance 各個ML系統在不同的threshold設定下會有不同的 - **False positive** - **False negtive** - **True positive** - **True negtive** :::success 可以透過以下方式理解假設一個情境，並令此情境成立時為 **positive class**，不成立時為 **negative class** 當情境與事實相同時為 **True** ，反之為 **False** 可以參考google developers的[文件](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative)，有更詳細的介紹 ![](https://i.imgur.com/dMK2st2.png) ::: 通常False的兩個值和True的兩個值會一高一低我們可以透過觀察這幾個數據的大小關係，去判斷一個ML的好壞以下有幾種常見方法: - Accuracy - F-score - Confusion matrix - ROC/AUC ### Accuracy 正確率 ![](https://i.imgur.com/4Z9ALx9.png) ### F-score F-score的基本式為 $F-score=\dfrac{(1 + \beta ^ 2)\ precision \times recall}{\beta ^ 2\ precision + recall}$ F-score是利用**recall**和**percision**兩個值互相抵制的關係做成的綜合結果當數值越接近1則表示準確率越高而F1-score則通指當 F-score $\beta=1$ 的狀況而 Recall 及 precision 分別可由以下算式求得 **Recall=**$\dfrac{true\ positive}{false\ negitive}$ **precision=**$\dfrac{true\ positive}{true\ positives+false\ positive}$ ![](https://i.imgur.com/js7vrpO.png) ### Confusion matrix 盡量將結果為正確率提升，降低錯誤率正確率與錯誤率兩者之間互相抗衡可以根據不同情況可以調整參數在某些特殊情況也許寧可誤報也不要沒報到利用資料視覺化可以清楚了解正確率以及錯誤率 - heatmap ![heatmap](https://i.imgur.com/yRirHXb.png) ### ROC & AUC - **受試者工作特徵曲線**(ROC, Receiver operating characteristic) - **ROC 曲線下面積**(AUC, area under the curve) ![](https://i.imgur.com/EH3ZEqC.png) 在ROC中，是**False positive** 和 **True positive**的變化圖 AUC越大，代表在綜合狀況中這個model的準確度比較好 ![](https://i.imgur.com/KGM4TM0.png) ## ML Models ![](https://i.imgur.com/XQh0R5g.png) ## Lab 0-1 [練習1 /mlsec/frauddetect/logistic-regression-fraud-detection.ipynb](http://127.0.0.1:8888/notebooks/mlsec/frauddetect/logistic-regression-fraud-detection.ipynb) One hot encoding ```python= df = pd.get_dummies(df, columns = ["paymentMethod"]) df.sample(3) ``` training/testing set ```python= X_train, X_test, y_train, y_test = train_test_split( df.drop('label', axis=1), df['label'], test_size=0.33, random_state=17) ``` Model ```python= # Initialize and train classifier model clf = LogisticRegression().fit(X_train, y_train) # Make predictions on test set y_pred = clf.predict(X_test) ``` - What you should understand - Basic usage about pandas dataframe - Basic usage of sklearn - A standard procedure for ML - One-hot encoding 有些資料做加減是無意義的，可以對它做**One-hot encoding** >例如:班級是重要的資料，但是數值上的距離沒有任何意義 >所以我們可以把他表示成類似bool的樣貌，讓他的+/-存在意義 ## machine learning 101 ![](https://i.imgur.com/dOt1sLu.png) - Types of machine learning use cases: - Regression - [參考資料](https://brohrer.mcknote.com/zh-Hant/how_machine_learning_works/how_linear_regression_works.html) - Classification - [參考資料](https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623) - Anomaly detection - [參考資料](https://medium.com/@cyeninesky3/oneclass-svm-%E7%95%B0%E5%B8%B8%E6%AA%A2%E6%B8%AC%E4%BB%BB%E5%8B%99-anomaly-detection-%E7%9A%84%E7%AE%97%E6%B3%95%E7%90%86%E8%A7%A3%E8%88%87%E5%AF%A6%E8%B8%90-cf5f0bbb01c0) - Recommendation [web](https://www.infoq.com/presentations/relevance-recommendation-system/) > Regression ![](https://i.imgur.com/pySHl7M.png) > Classification ![](https://i.imgur.com/jdOejuo.png) ## Anomaly Detection 偵測異常狀況可以觀察資料的變化週期、頻率、規律當出現不符合規律的值，就會被偵測出來但是實際上要到變化差異多大才偵測，也要依情況不同做調整 ![](https://i.imgur.com/ZPoqxh1.png) 較類似於對於不平常的狀況提出警告，實際上是好或壞無法辨認 **Method** - Outliers vs. novelties - novelties: unobserved pattern in new observations not included in training data - Simple statistics/forecasting methods - Exponential smoothing, Holt-Winters algorithm - Machine learning methods - Elliptical envelope, density-based, clustering, SVM - ML相似度: 以當前所擁有的過去資料，判斷當前資料是否異常 ![](https://i.imgur.com/Ma1cM7l.png) **右邊圖中可以發現標記紅色的點是例外的狀況** ## Problems in ML ### over fitting - 學得太多 - when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. **How to solve?** ![](https://i.imgur.com/gr1pg2d.png) 將訓練資料分類成兩個部分 **Training Set** **Validate Set** 真正會去訓練的只有Training Set 而Validate Set會拿來做驗證 [參考資料](https://medium.com/%E9%9B%9E%E9%9B%9E%E8%88%87%E5%85%94%E5%85%94%E7%9A%84%E5%B7%A5%E7%A8%8B%E4%B8%96%E7%95%8C/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-ml-note-overfitting-%E9%81%8E%E5%BA%A6%E5%AD%B8%E7%BF%92-6196902481bb) ### under fitting - 學得太少 - not a suitable model and will be obvious as it will have poor performance on the training data - a model that can neither model the training data nor generalize to new data [參考資料](https://medium.com/@ken90242/machine-learning%E5%AD%B8%E7%BF%92%E6%97%A5%E8%A8%98-coursera%E7%AF%87-week-6-2-diagnosing-5a9e43db4593) ## School of ML - 根據演算法分類 - 符號理論學派(Symbolists) - 類比推論學派(logizers) - 貝氏定理學派(Bayesians) - 演化論學派(Evolutionaries) - 類神經網路學派(Connectionists) > 不同學派之間共享許多屬性 # 符號理論學派(Symbolists) > 所有智慧都能變成符號，透過邏輯運算就能求解所有答案 - Rule量太過於龐大，難以執行，但概念還是存在 => 自動Mining Rule ![](https://i.imgur.com/ipEQAIL.png) ## Rule mining 嘗試將物品綁成一個集合(set)，看看彼此間是否具有關聯性，計算彼此間support好壞將全部計算過一遍之後去計算兩者之間的關聯性 ![](https://i.imgur.com/EhrIuNq.png) 我們可以全部Travasal過一遍，但是非常耗時 =>剪枝 ![](https://i.imgur.com/HRX8BhM.png) ![](https://i.imgur.com/ayGyS3f.png) 如果上面的資料已經確定比其他還要小，那麼它底下的也不用去判斷 ## Building Decision Tree - 在每個node提出詢問，再將資料分成兩群(切割) ## Emptropy [資訊:熵_wiki](https://zh.wikipedia.org/wiki/%E7%86%B5_(%E4%BF%A1%E6%81%AF%E8%AE%BA)) 以物理熱理學角度去衡量資料(亂度or熵) ![](https://i.imgur.com/f4IYPpX.png) $Entropy(S) \equiv -p_\oplus log_2 p_\oplus - p_\ominus log_2 p_\ominus$ 計算Information Gain，篩選出比較好的如果在屬性A上做拆分則S的熵較小 Sv = S的子集，其屬性A具有值v gain(SA)=原本的熵-sum(A上的熵) $Gain(S,A) \equiv Entropy(S) - \sum_{v \epsilon Values(A))} \frac{\left | S_v \right |}{\left | S \right |} Entrpoy(S_v)$ ![](https://i.imgur.com/1LFsrfo.png) [動態圖表(切割資料)](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/) ![](https://i.imgur.com/wOxHcck.png) 上面綠色區塊以及下面藍色區塊比較集中左下的資料比較雜亂，因此切割成三個區域 **切得太深可能會發生Over Fitting** ![](https://i.imgur.com/ur2ezKZ.png) 這裡的準確度差異就有點大，可能是切得太深造成Over Fitting 盡量要使資料能比較General(放諸四海皆準) ## Lab 1-2 [練習2 /mlsec/malware/malware-classification.ipynb](http://127.0.0.1:8888/notebooks/mlsec/malware/malware-classification.ipynb) - What to do? - Using Decision Tree to detect malware - Play tree-based algorithm - Which algorithm is best, why? 在windows底下的檔案屬於[PE](https://zh.wikipedia.org/wiki/%E5%8F%AF%E7%A7%BB%E6%A4%8D%E5%8F%AF%E6%89%A7%E8%A1%8C) 在linux底下檔案屬於[ELF](https://zh.wikipedia.org/wiki/%E5%8F%AF%E5%9F%B7%E8%A1%8C%E8%88%87%E5%8F%AF%E9%8F%88%E6%8E%A5%E6%A0%BC%E5%BC%8F) ![](https://i.imgur.com/v9AmoXS.png) 最後一個欄位**legitimate**是最重要的(我們關注的欄位) ## Lab 1-3 [練習3 /mlsec/network/nsl-kdd-classification.ipynb](http://127.0.0.1:8888/notebooks/mlsec/network/nsl-kdd-classification.ipynb) ## challange of Symbolists 符號理論學派ㄉ挑戰 - Knowledge Acquisition - Set of Rules - Limitation - difficult to address uncertaincy ## 補充 cyber grant challenge 做出一個機器能自動攻擊/防守 # 類比推理學派(Analogizers) 學習的關鍵是認識各種情況的相似之處，從而推斷其他情境的相似地方。以**距離**及**相似度**來判斷 **維度越高越可能造成overfitting** ## Regresion > regression = finding relationships between variables ![](https://i.imgur.com/RYr2LwL.png) 找到一條線，使得所有點到線的距離總和最低以下有各種維度 ### Liner ![](https://i.imgur.com/92wjdea.png) ### Polynomial ![](https://i.imgur.com/sWFIKfp.png) ## Model optimization 最優化模型 ### grandient descent(梯度下降法) (做Regression最常用) 沿著斜率逐漸往下走直到最低點，使得各點**ERROR值總合為最小** ![](https://i.imgur.com/oHVefrx.png) 當數值皆穩定下降，則數值(**alpha**、**iters**)為佳 > alpha:學習速率 > iters:學習次數 > [練習學習速率的好網站 :\)](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/playground-exercise) ### Grandient descent ![](https://i.imgur.com/6CPOzE6.png) ## Lab 2-1 [練習4 /mlsec/intro/00-linear-regression.ipynb](http://127.0.0.1:8888/notebooks/mlsec/intro/00-linear-regression.ipynb) - What to do - Understand Gradient Descent Algo - Adjust alpha/learning rate ## Learning Rate ![](https://i.imgur.com/i4xODBt.png) Learning Rate的大小會影響模型Accuracy的穩定度 - Learning Rate 太大雖然知道數值應該往哪邊調整，但是會一直無法收斂 - Learning Rate 太小每次數值調整的幅度太小，會使得數據調整時間太久我們可以每次根據不同的情況來調整Learning Rate <img style="-webkit-user-select: none;margin: auto;cursor: zoom-in;" src="http://1.bp.blogspot.com/-K_X-yud8nj8/VPmIBxwGlsI/AAAAAAAACC0/JS-h1fa09EQ/s1600/Saddle%2BPoint%2B-%2BImgur.gif" width="542" height="419"> ## Lab 2-2 [練習5 /mlsec/intro/01-logistic-regression.ipynb](http://127.0.0.1:8888/notebooks/mlsec/intro/01-logistic-regression.ipynb) - What to do - Make a sigmod function ## Decision Boundary 決策邊界 ### K-nearest neighbor classifier(KNN) ![](https://i.imgur.com/8C6YPfz.png)  **練習** [練習 /mlsec/network/nsl-kdd-classification.ipynb](http://127.0.0.1:8888/notebooks/mlsec/network/nsl-kdd-classification.ipynb) ### K-means ![](https://i.imgur.com/xU2tAnd.png) ### Lab 2-3 Kmean clustering [練習 /mlsec/intro/04-kmeans-pca.ipynb](http://127.0.0.1:8888/notebooks/mlsec/intro/04-kmeans-pca.ipynb) ### Support vector machines(SVM) 由低維度投影至高維度，並用平面在上面做切割 > [介紹影片](https://www.youtube.com/watch?v=3liCbRZPrZA) {%youtube 3liCbRZPrZA %} ### The Curse of Dimensionality 維度越低越好對於高維度的各點，距離都很遠 ![](https://i.imgur.com/CaPzkIt.png) # 貝式定理學派(Bayesians) 學習是機率推理的一種形式好的假設一直更新 => 越來越好 - 根據 **時間** **證據** 更新 **hypothesis** 可信度透過經驗作調整，證據夠多時間夠久就可以找最好的答案 ![](https://i.imgur.com/Uv9rsFj.png) 更新機率 ![](https://i.imgur.com/9kUYK1s.png) **事前機率很重要** ![](https://i.imgur.com/ESO5XPv.png) 好的假設會一直更新，且越來越好 >why Bayesians? >人們往往會透過機率去做選擇 >像是當降雨機率高時，人們會較常帶雨傘出門 >所以可以利用貝式定理來做機器學習同樣的機器，在不同環境條件下，會造成不同結果 => **事前機率很重要**(因為很重要，所以說兩遍) 要有好的事前機率不然事後機率會誤導你? (目前只能透過 **經驗** 或 **猜?** 來得到事前機率) ## Lab 3-1 [練習6 /mlsec/spam/spam-fighting-blacklist.ipynb](http://127.0.0.1:8888/notebooks/mlsec/spam/spam-fighting-blacklist.ipynb) - What to do - Implement blacklist mechanism - Implement Bayes detector ## Markov chain 將貝式的結果轉成圖 ![](https://i.imgur.com/yQtocdP.png) 將矩陣相乘 ![](https://i.imgur.com/ov31P4W.png) ## Hidden Markov 資料中有可以觀察到的/不能觀察到的不能觀察到的資料之間可能也會互相影響 ![](https://i.imgur.com/5auRCDr.png) ## 蒙地卡羅樹搜尋 Monte Carlo Tree Search [wikipedia](https://zh.wikipedia.org/wiki/%E8%92%99%E7%89%B9%E5%8D%A1%E6%B4%9B%E6%A0%91%E6%90%9C%E7%B4%A2) alphaGO會用的演算法(下棋類) ![](https://i.imgur.com/jf9I2mi.png) ### 補充 Google Page Rank 利用網頁上超連結個數和品質分析網頁的演算法 ![](https://i.imgur.com/vhGcJ1o.png) # 演化論學派(Evolutionaries) - 物競天擇 - Change and Select - Fitness - Mutate - Cross-over 隨機挑選出population，看fittness高低選擇誰留下繁衍下一代，選擇突變... ![](https://i.imgur.com/tEsf1z8.png) 最重要的是 Crossover 和 Mutation ![](https://i.imgur.com/NruUQXw.png) ## 基因演算法(GA, genetic algorithm) ![](https://i.imgur.com/hDbLGbd.png) - 編碼每一條染色體是由許多基因所組成可以用0/1或是實數來編碼 - 產生初始母體 - 計算合適度函數針對每一條染色體，計算合適度函數（Fitness function）評估每一個解的好壞程度，判斷是否要保留 - 選擇與複製由合適度函數的數值，把每一條染色體由數值大至小（優至劣）進行排列，選擇最優秀的一定百分比的染色體進行複製。 - 交配由選擇與複製所留下的染色體，選出兩條染色體，進行交配。 - 突變設定一突變率，小部分的染色體會互換基因 # 類神經網路學派(Connectionists) 模擬人腦，設計網路架構有很多神經元(neural)互相傳遞資訊 - softmax classfication ![](https://i.imgur.com/A5L7D2n.png) 每個pixel都會連結到每個node **讓大的數值任它接近1，讓小的數值盡量接近0** ![](https://i.imgur.com/UBoscRZ.png) 矩陣相乘後加上位移 ![](https://i.imgur.com/1WaOkOv.png) entropy希望越低越好 ![](https://i.imgur.com/d257M4k.png) {%youtube LeAacAzd6oY %} ![](https://imgur.com/wIvsKSc.gif) > [name=鄭皓玶]轉成gifㄌ ## 激勵函數 ### S函數(sigmoid) ![](https://i.imgur.com/LTUnmaj.png) ![](https://i.imgur.com/fhjHkOd.png) 從圖中可以發現到Sigmoid的Accuracy上升速度很慢 ### 修正線性單元(ReLU, Rectified Linear Unit) 用於對上一層的所有輸入求加權和，然後生成一個輸出值（通常為非線性值），並將其傳遞給下一層，此函數是為解決梯度爆炸問題。低於零的時候很緩慢，超過 0 的時候就上升比較快 ![](https://i.imgur.com/z5LhlFS.png) ![](https://i.imgur.com/c0ucrV2.png) 從圖中可以發現ReLU的Accuracy上升速度很快 ReLU可以比sigmoid更快速訓練模型 ### Demo Overfitting ![](https://i.imgur.com/XEPSbOR.png) 因為learning rate固定，造成Accuracy的不穩定有時候明明已經走到終點，但是又會走超過 testing & training 差距太多可能是**overfitting** {%youtube iCtkEvMEhpc %} **Solution** ![](https://i.imgur.com/lt1ptSN.png) 將learning rate**修正為不定值** 在經過判斷cross entropy大小來調整learning rate，如果cross entropy較小就讓learning rate調低，反之亦同 >cross entropy ![](https://i.imgur.com/d257M4k.png) ## 卷積神經網路 (CNN, Convolution Neural Networks) > CNN=Convolution + Neural Networks ![](https://i.imgur.com/GQppHNm.png) 專用於擷取圖片特徵 ![](https://i.imgur.com/SI0AGqQ.png) ### convolution ![](https://i.imgur.com/NcydXm0.png) ## 遞歸神經網路 (RNN, Recurrent Neural Networks) ![](https://i.imgur.com/0iC9d1y.png) 專門用於擷取文章特徵 ## 長短期記憶模型 (LSTM, Long short-term memory) ![](https://i.imgur.com/O72athG.png) 適合於處理和預測時間序列中間隔和延遲非常長的重要事件 > [RNN 與 LSTM 的介紹](https://brohrer.mcknote.com/zh-Hant/how_machine_learning_works/how_rnns_lstm_work.html) # Other ## PCA(Principal Component Analysis) 去除相依性，避免從不同的feature裡面學到錯誤如果x和y呈線性關係，可以以其中一個當作feature，做出剩下1維的圖 - 將坐標軸中心移到數據的中心，然後旋轉坐標軸，使得數據在C1軸上的變異數最大，即全部n個數據個體在該方向上的投影最為分散。意味著更多的信息被保留下來。C1成為第一主成分。 - C2第二主成分：找一個C2，使得C2與C1的共變異數（相關係數）為0，以免與C1信息重疊，並且使數據在該方向的變異數儘量最大 ![](https://i.imgur.com/XnKVnMi.png) # 模型攻擊 ## 練習一 [練習一 (/mlsec/adversarial_learn/classifier-poisoning.ipynb)] train一個machine learning 的model去對抗 adversarial_learn 對抗式直接更改model本身，使得model的判斷出現錯誤 (http://127.0.0.1:8888/notebooks/mlsec/adversarial_learn/classifier-poisoning.ipynb) **原圖** ![](https://i.imgur.com/hcCm4yw.png) **攻擊後** 在原模型中塞入幾個異常點`*`使得判斷線移動 ![](https://i.imgur.com/p8JzgNN.png) ## 練習二 [練習二 (/mlsec/adversarial_learn/binary-classifier-evasion.ipynb)](http://127.0.0.1:8888/notebooks/mlsec/adversarial_learn/binary-classifier-evasion.ipynb) 用樣本(sample)錯誤訊息去誤導 - vectorizer 一個字出現的頻率 - classifier 惡意xx? 如果拿到一個model 可以reverse 別人的演算法在垃圾郵件中塞入一堆不是垃圾訊息的片段，可能會讓Model誤以為不是垃圾郵件 ![](https://i.imgur.com/MMmfF6G.png) 在[13]可以找到垃圾訊息比重最低的字串`t/s` 原本的垃圾訊息(1)`<script>alert(1)</script>`再加上1000個`t/s`後會被誤判為一般訊息(0) ## 練習三 [練習三(/mlsec/log_attack_traffic_analysis/NetworkMonitor.ipynb)](http://127.0.0.1:8888/notebooks/mlsec/log_attack_traffic_analysis/NetworkMonitor.ipynb) # Auto ML ## Difficult Task in ML - Data Analysis/Data Purification 資料分析/資料純化 - (If NN is used) Find the best network structure （如果使用NN）找到最佳的網絡結構 - Hyperparameter Tuning 超參數調整 ## Importance of architectures for Vision - Designing neural network architectures is hard 設計神經網絡架構很困難 - Lots of human efforts go into tuning them 許多人努力調整它們 - There is not a lot of intuition into how to design them well 如何很好地設計它們並沒有很多直覺 - Can we try and learn good architectures automatically? 我們可以嘗試自動學習良好的架構嗎？嘗試使用ML來訓練ML > 原始: Solution = ML expertise + Data + Computation > 新: Solution = Data + 100\*Computation 重複多次，將參數做小量修正，將所有可能情況列舉，找出最佳 ![](https://i.imgur.com/DsQbkbj.png) [練習1 /mlsec/preprocess/sklearn-gridsearch.ipynb](http://127.0.0.1:8888/notebooks/mlsec/preprocess/sklearn-gridsearch.ipynb) [練習2 /mlsec/preprocess/missing-values-imputer.ipynb](http://127.0.0.1:8888/notebooks/mlsec/preprocess/missing-values-imputer.ipynb) # Interpretable Machine Learning ![](https://i.imgur.com/2bPlwZ3.png) 我們不能只是以"那是機器做出的選擇"當作理由太過不負責任如果我們不知道ML到底學到了什麼，如果出錯了我們就不能知道問題出在哪裡如果我們能知道機器在想什麼，或是我們能知道機器學了什麼，就能解決這樣的問題 - Problems - Debugging the model - Determine whether the model is synchronized with our knowledge - Judge the decision of a model - What knowledge, information the model can teach us > We need to understand the model we trained – Interpretability, or reasoning ## Explain Data 去理解資料 - Before building any model 建立任何模型之前 - Visualization for data exploration 可視化數據探索 - Exploratory data analysis 探索性數據分析 ![](https://i.imgur.com/PjhU51Y.png) ## Interpretable models 一開始就將Model設定的簡單容易理解 ### Algorithm - SVM - Decision Tree - Bayesian - Markov Chain - NN 可能會造成太複雜 ![](https://i.imgur.com/I7yksV7.png) ## Explain Model - 去理解Model的想法，是甚麼影響它的決策 ![](https://i.imgur.com/9BCFH98.jpg) - 將判斷中的weight高低圖拉出來看 ![](https://i.imgur.com/n6Znv5F.png) - 調整變因，去看看Model會不會因為資料不同而產生不同解果 ![](https://i.imgur.com/qnhBf5z.png) ## Local explaination ### LIME ![](https://i.imgur.com/V7RZ7W5.png) 將整個圖片分成幾塊，嘗試拿掉幾塊去看看整個判斷是否有改變 ![](https://i.imgur.com/PWvL3Wt.png) [練習 (/mlsec/model_reasoning/lime-explainability-spam-fighting.ipynb)](http://127.0.0.1:8888/notebooks/mlsec/model_reasoning/lime-explainability-spam-fighting.ipynb) 將單一郵件的資料丟進去，去看看為甚麼會被標記成spam ![](https://i.imgur.com/K2l14x5.png) ### SHAP 嘗試移除幾個feature，看看兩者之間的差異 - SHAP is black-box agnostic and provides local explainability via additive feature attribution - Shap values quantify feature importance - They correspond to the average distance between the model’s answer and all possible reduced models that omit a feature ![](https://i.imgur.com/m1XypEf.png) ## Global explaination >Global interpretability 幫助我們了解每個 feature 與預測值之間的關係。除此之外，global interpretability 給予我們對於預測值而言，每個 feature 影響的方向 & 數字尺度的概念。先訓練出複雜的classifer，再輸入資料到classifer裡面，得到相對應的輸出 ![](https://i.imgur.com/Lhpz1iX.png) 再利用上面拿到的資料建立一個Decision Tree，Train出一個簡單的classifer ![](https://i.imgur.com/NeOEZO1.png) # Lab詳細 [Lab 0-1](/6uiWpQ6STKixyvs63UNu2Q) [Lab 1-2](/rqCmvjG6QdKvnx2kW5amsA) [Lab 1-3](/VQyx-xSMTJ-dI7NQT4wc8w) <style> span.hidden-xs:after { content: ' × ML Security' !important; } </style> ###### tags: `ML Security`