Demo_11 - HackMD

--- title: Demo_11 tags: MF --- # Fraudulent Firms Detection via Binary Classification in Audit Risk Scores #### keywords: *fraud detection*, *binary classification*, *audit risk*, *logistic regression*, *random forest*, *SVM*, *KNN*, *neural network* :::info 陽交大資財系機器學習與金融科技期末報告，由許智超，李亦涵，鄧惠文共同編輯。最後更新時間2021/11/22。 ::: --- ## 1. Motivations I am interested in building a ***classification model*** in detecting ***fraudulent firms***, to help auditors reduce the ***audit risk***. ![](https://i2.wp.com/www.iedunote.com/img/22727/audit-risk.jpg?resize=300%2C300&quality=100&ssl=1) --- ## 2. Data Visualization (EDA) 有關各領域(sectors)的公司數量： |||| |:---:|:---:|:---:| |灌溉Irrigation (114)|公共衛生Public Health (77)|建築與道路Buildings and Roads (82)| |森林Forest (70)|一般企業Corporate (47)|畜牧業Animal Husbandry (95)| |傳播Communication (1)|電子Electrical (4)|土地Land (5)| |科學科技業Science and Technology (3)|旅遊業Tourism (1)|漁業Fisheries (41)| |工業Industries (37)|農業Agriculture (200)| --- ### 名詞解釋 - **財報審計**: 1. 財報是否有恰當地編制? 2. 公允反應財務信息? - **審計風險(audit risk)**: 審計師發表不恰當的審計意見的風險 $$審計風險(audit\ risk) = $$ $$=重大錯報風險(material\ misstatement\ risk) \times 檢查風險(detection\ risk)$$ $$= (固有風險(inherent\ risk) \times 控制風險(control\ risk)) \times 檢查風險(detection\ risk)$$ --- - **重大錯報風險(material misstatement risk)**: 1. 與審計師無關 2. 固有風險 $\times$ 控制風險 - **固有風險(inherent risk)**: > 例如:會計人員將200元記錄成200萬元的風險。 - **控制風險(control risk)**: > 例如:會計人員將200元記錄成200萬元了，而會計主管也未能即時發現並糾正的風險。 - **檢查風險(detection risk)**: 1. 只與審計師相關 2. 不可能將發生率降至0 > 例如: 會計人員將200元記錄成200萬元了，而會計主管也未能即時發現並糾正，然後審計人員測試時還是沒有發現的風險。 **條件機率！？** --- ### 變數說明 ### 固有風險因子(Inherent Risk Factors) |Variables|Columns|Descriptions| |---|---|---| |$x_A$|**PARA_A**|稽核(inspection)中的*計畫支出(planned expenditure)* 總額，與財報A(summary report A)之間，分歧(discrepancy)的總金額，單位是印度貨幣「盧比Rs」。我們代稱此為「A情況」| |$p_A$|**Score_A**|上述分歧發生的機率值，此稱「分數A」| |$risk_A$|**Risk_A**|風險分數A = A情況損失總額 $\times$ 機率值 = "PARA_A" $\times$ "Score_A"| --- |Variables|Columns|Descriptions| |---|---|---| |$x_B$|**PARA_B**|稽核(inspection)中的*非計畫支出(unplanned expenditure)* 總額，與財報B(summary report B)之間，分歧(discrepancy)的總金額，單位是印度貨幣「盧比Rs」。我們代稱此為「B情況」| |$p_B$|**Score_B**|上述分歧發生的機率值，此稱「分數B」| |$risk_B$|**Risk_B**|風險分數B = B情況損失總額 $\times$ 機率值 = "PARA_B" $\times$ "Score_B"| |<hr/>|<hr/>| |$total_{AB}$|**TOTAL**|上述兩種分歧情況加總的總損失金額，也就是"PARA_A" $+$ "PARA_B"| --- |Variables|Columns|Descriptions| |---|---|---| |$x_c$|**numbers**|歷史資料的分歧分數(historical discrepancy score)| |$p_c$|**Risk_B.1**|對應上述歷史資料分歧分數，類似機率值/權重的東西| |$risk_C$|**Risk_C**|風險分數C = 歷史分歧分數 $\times$ 對應權重 = "numbers" $\times$ "Risk_B.1"| --- |Variables|Columns|Descriptions| |---|---|---| |$x_D$|**Money_Value**|過去的審計中，錯報(misstatements)的金額總額| |$p_D$|**Score_MV**|對應上述歷史錯報金額總額，類似機率值/權重的東西| |$risk_D$|**Risk_D**|風險分數D = 歷史錯報金額總額 $\times$ 對應權重 = "Money_Value" $\times$ "Score_MV"| |<hr/>|<hr/>| |$total_{in}$|**Inherent_Risk**|就是總固有風險分數| --- ### 控制風險因子(Control Risk Factors) |Variables|Columns|Descriptions| |---|---|---| |$x_{SS}$|**Sector_score**|根據每筆資料所屬的部門(sector)，使用一些分析程序計算出來的，該部門的歷史風險分數(historical risk score)| --- |Variables|Columns|Descriptions| |---|---|---| |$x_E$|**District_Loss**|一個區域(district)在*近10年內(in the last 10 years)* 的歷史風險分數(historical risk score) |$p_E$|**PROB**|對應上述某區域10年內歷史風險分數，類似機率值/權重的東西| |$risk_E$|**RiSk_E**|風險分數E = 某區域10年內歷史風險分數 $\times$ 對應權重 = "District_Loss" $\times$ "PROB"| --- |Variables|Columns|Descriptions| |---|---|---| |$x_F$|**History**|此公司*近10年內*的歷史損失(historical loss)的平均(average)| |$p_F$|**Prob**|對應上述公司10年內的歷史損失平均，類似機率值/權重的東西| |$risk_F$|**Risk_F**|風險分數F = 公司10年內的歷史損失平均 $\times$ 對應權重 = "History" $\times$ "Prob"| |<hr/>|<hr/>| |$total_{co}$|**CONTROL_RISK**|就是總控制風險分數| --- ### 檢查風險因子(Detection Risk Factors) |Variables|Columns|Descriptions| |---|---|---| |$risk_{DR}$|**Detection_Risk**|就是檢查風險分數| --- ### 其他特徵(Other Features) |Variables|Columns|Descriptions| |---|---|---| |$x_{loc}$|**LOCATION_ID**|每筆資料所屬城市/省份的唯一ID編號| |$x_s$|**Score**|不明分數，資料來源論文並未敘明| --- |Variables|Columns|Descriptions| |---|---|---| |$y_{score}$|**Audit_Risk**|使用一些分析程序計算出來的總風險分數(total risk score)| |$y$|**Risk**|每筆資料是否為詐欺公司的二元類別變數。***This is the target feature!***| --- ### EDA部份 ```python= #匯入函式庫 import numpy as np import pandas as pd import matplotlib.pyplot as plt #這一行魔法函式可以省略每次的plt.show()，直接顯示 %matplotlib inline import seaborn as sns import os #因為我們沒有testing data，所以要切一份出來 #以下EDA將會進行於trainig data from sklearn import model_selection #圖片樣式美化 plt.style.use('dark_background') ``` ```python= #切換工作目錄 os.chdir("C:\\Users\\GL75\\OneDrive\\桌面\\機器學習與金融科技\\Project") #讀入資料 data = pd.read_csv('audit_data.csv') ``` ```python= #觀察資料格式 #顯示所有的行，為了觀察各變數名稱，以及它們之間的關係 pd.options.display.max_columns = None data.head() #展示頭5筆的資料 ``` --- 資料的樣子 ![](https://i.imgur.com/J8H9lwE.png) ![](https://i.imgur.com/YaFyqtM.png) ![](https://i.imgur.com/7CgRbdn.png) --- ```python= #列出所有行的標題，也就是變數名稱 data.columns ``` ![](https://i.imgur.com/eEc9Eb4.png) --- ### 資料前處理重新命名 ```python= #有些變數名稱不太統一，我們稍微修正，統一大小寫格式 data.rename(columns = {'RiSk_E':'Risk_E', 'LOCATION_ID':'Location_ID', 'PARA_A': 'Para_A', 'PARA_B':'Para_B', 'TOTAL': 'Total', 'PROB': 'ProbA', 'Prob': 'ProbB', 'CONTROL_RISK': 'Control_Risk'}, inplace = True) ``` 刪除變數Total ```python= #因為變數TOTAL其實就等於PARA_A+PARA_B，沒有額外貢獻資訊 #因此我們drop掉這個變數 data.drop(['Total'], axis = 1, inplace = True) #inplace = True可以直接就地修改 ``` --- 沒有缺失值 ```python= data.info() #可以看到總共有776筆資料，全部變數都沒有缺失值(null value) #除了"District_Loss", "History"以及目標變數"Risk"是整數型別(int32)，"LOCATION_ID"是某種object以外 #其他所有變數都是浮點數型別(float64) ``` ![](https://i.imgur.com/ODOlOTd.png) --- 敘述統計量 ```python= #以下列出基本的敘述統計量(descriptive statistics) data.describe() ``` ![](https://i.imgur.com/LtrkLuC.png) ![](https://i.imgur.com/bW9unBo.png) ![](https://i.imgur.com/hpJoKu3.png) --- 刪除變數"Detection_Risk": 1. mean = Q1 = Q3 = 0.5 2. standard deviation = 0 全是0.5？！by nunique() ... indeed! ```python= data['Detection_Risk'].nunique() ``` ![](https://i.imgur.com/DZNWZX1.png) ```python= data.drop(['Detection_Risk'], axis = 1, inplace = True) ``` --- 刪除變數"Location_ID": 1. Unknown Data type 2. Not helpful ```python= data['Location_ID'].dtypes ``` ![](https://i.imgur.com/BnF6Wls.png) ```python= data.drop(['Location_ID'], axis = 1, inplace = True) ``` --- 相關係數 ```python= #首先是熱力圖heatmap plt.figure(figsize = (20, 20)) sns.heatmap(data.corr(), annot = True, cmap = 'coolwarm') plt.savefig("heatmap.png") ``` ![](https://i.imgur.com/tivKlTE.png) --- 1. "Sector_score"跟其他變數之間都呈現負相關。 > 特別是"Score_A", "Score_MV", "Score" 和 "Risk"。 2. 有些變數間有極高的正相關: * **完全正相關(1或0.99)**: |||||| |:---:|:---:|:---:|:---:|:---:| |Para_A|Para_B|Score_B.1|Money_Value|Risk_F| |Risk_A|Risk_B|Risk_C|Risk_D|History| * **極高度正相關(0.9以上)**: ||||||| |:---:|:---:|:---:|:---:|:---:|:---:| |numbers|numbers|District_Loss|Score_B|Audit_Risk|Audit_Risk| |Score_B.1|Risk_C|Risk_E|Score|Para_B|Risk_B| 保留變數： > **完全正相關**: "Risk_A"、"Risk_B"、"Risk_C"、"Risk_D" 和 "Risk_F" > **極高度正相關**: "Score_B.1", "Risk_E"。至於變數"Score"因為我們不了解，因此不會隨意刪除。故"Score"與"Score_B"這組2個都保留。 3. 刪除變數"Audit_Risk": A. 受所有risk分數影響 B. 目標變數(target variable)"Risk"的判斷基準 ```python= data.drop(["Para_A","Para_B","Score_B.1","Money_Value","History","numbers","District_Loss","Audit_Risk"], axis = 1, inplace = True) ``` --- 刪除變數後的correlation heatmap -> better! ```python= plt.figure(figsize = (20, 20)) sns.heatmap(data.corr(), annot = True, cmap = 'coolwarm') plt.savefig("heatmap_deleteHighCorr.png") ``` ![](https://i.imgur.com/iUmdXxh.png) 扣除相關係數後，其他變數之間，彼此已經沒有嚴重的高相關性了。 --- 相關性：目標變數 **"Risk"** vs. 其他 ```python= #首先，將Risk這個變數獨立出來，叫做y，之後要當target用的，所以不要混在一起 y = data.Risk #除此之外的其他資料就是我們要用來分類的變數dataset，叫做x x = data.drop(['Risk'], axis = 1) x.corrwith(y).plot.bar(figsize = (20, 10), fontsize = 20, grid = True, color = ['r','g','b'] * 3).set_title("Correlation between 'Risk' and other variables", fontsize = 20) plt.savefig("risk_corr_with_others.png") ``` ![](https://i.imgur.com/X7N8Hxf.png) 如前所述，除了變數**Sector_score**和所有其他變數之間都呈現*負相關*以外，其他變數都和目標變數**Risk**之間都呈現或多或少的正相關(相關係數從0.2~0.8左右)。 --- ```python= y = pd.DataFrame({'Risk':data['Risk']}) #把y從pandas的Series轉成DataFrame #以便後續分析使用 ``` --- 樣本中的**違約比例** ```python= sns.countplot(y['Risk'], label = 'Count') ``` ![](https://i.imgur.com/6Z0LpPQ.png) ```python= y.value_counts(normalize = True) ``` ![](https://i.imgur.com/eMzHOq8.png) > About **39.3%** --- ## 3. Problem formulation and our methods $$ Audit\ Risk=f(Inherent\ Risk, Control\ Risk, Detection\ Risk)\\ = f(\sum_{i=A}^Drisk_i+\sum_{j=E}^Frisk_j+risk_{DR})=f(total\ risk \ score)=f(y_{score})\\ \begin{cases} 1 & \text{if total risk score} \ge {1}\\ 0 & \text{if total risk score < 1} \end{cases}\\ $$ $$ \text{Prediction = 1 denotes that the firm might be fraudulent, 0 otherwise.} $$ 我們想要找出模型$f()$，這個模型使用**審計風險**中的各種風險分數當成輸入，輸出一個二元預測值，代表這筆觀察值的公司**是否有詐欺風險存在**。越準確越好，使用**混淆矩陣**，以及其他有關方法測量(準確率、精確率、召回率、F1分數、ROC以及AUC等等)。 **Benchmark Method**: 伯努力試驗(*Bernoulli trials*) $n$ 次。 1. 0/1分別代表可能不是/是詐欺公司。 2. $n$ = 訓練/測試資料集長度 3. $p$ = 樣本中的詐欺公司比例 $\approx$ **39.3%** **Models Used**: 1. 羅吉斯回歸(Logistic Regression) 2. 隨機森林(Random Forest) :trophy: 3. K-近鄰演算法(K Nearest Neighbor) 4. 支持向量機(Support Vector Machine) 5. 簡單神經網路(Vanilla Neural Network) **In-sample and Out-of-sample Analysis**: ```python= from sklearn.model_selection import train_test_split ``` 1. 訓練資料：驗證資料 = 3：1 2. 用ground truth的比例去做**分層抽樣 (stratified sampling)** 3. 隨機打亂資料 (*seed = 42*) ```python= x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, stratify = y, random_state = 42) ``` --- ## 4. Analysis and Conclusion ### 切割驗證資料集請見3. problem formulation中的**In-sample and Out-of-sample Analysis** ### 資料前處理 (Data Preprocessing) 使用*sklearn*的**StandardScaler**將所有變數**標準化** > 都是連續型 ```python= from sklearn.preprocessing import StandardScaler ``` ```python= stdScaler = StandardScaler() #我們先用fit_transform()找出訓練資料的統計量值特徵，再用這些將訓練資料標準化 x_train_scaled = pd.DataFrame(stdScaler.fit_transform(x_train)) #然後再對測試資料直接用transform()，拿訓練資料的這些統計量值，對測試資料標準化。 #原因是我們假設他們出自相同分布，所以當然沒必要重新fit一次 x_test_scaled = pd.DataFrame(stdScaler.transform(x_test)) #要注意fit_transform()和transform不能倒過來用，不然會報錯！ ``` 將訓練資料以及驗證資料的ground truth都轉成**ndarray**格式 ```python= y_train_np = y_train['Risk'].to_numpy() ground_truth = y_test['Risk'].to_numpy() ``` --- ### Benchmark Method: Binomial Sampling > $n$ = 訓練/測試資料集長度 > $p$ = 39.3% ```python= np.random.seed(42) n_train = len(x_train_scaled.index) #縮放後訓練資料集的長度 n_test = len(x_test_scaled.index) #縮放後測試資料集的長度 p = y.value_counts(normalize = True)[1] #樣本中的有審計風險的公司比例約為39.3%，以此做為估計量 #從進行n次的白努力試驗(Bernoulli's Trials)，機率約為39.3%，n取決於資料集長度 benchmark_train = np.random.binomial(1, p, n_train) benchmark_test = np.random.binomial(1, p, n_test) ``` --- ### Cross Validation for Benchmark Method 之後所有方法都用**K-fold validation**，$K=10$ ```python= Kfold = model_selection.KFold(n_splits = 10, random_state = 42, shuffle = True) acc = [] for _, test_idx in Kfold.split(x_train_scaled): predicted = benchmark_train[test_idx] ground_truth = y_train['Risk'].to_numpy()[test_idx] #計算這次fold的驗證資料的準確率(accuracy) acc.append((predicted == ground_truth).sum() / len(test_idx)) ``` 驗證資料的準確率(accuracy)約為 > 50.7% ```python= acc_val = sum(acc) / len(acc) print(acc_val) ``` ![](https://i.imgur.com/kcL4FEd.png) --- ### Model Evaluation for Benchmark Method 先求出真實答案&預測結果 ```python= ground_truth = y_test['Risk'].to_numpy() predicted = benchmark_test ``` 使用*sklearn*套件中，*metrics*包裡面的**accuracy_score, f1_score, precision_score, recall_score, roc_auc_score**等方法計算**準確率、F1分數、精確率、召回率、ROC的AUC** ```python= from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score ``` 一樣，往後其他模型也都這樣做。 ```python= accuracy = accuracy_score(ground_truth, predicted) precision = precision_score(ground_truth, predicted) recall = recall_score(ground_truth, predicted) F1 = f1_score(ground_truth, predicted) ROC = roc_auc_score(ground_truth, predicted) results_benchmark = pd.DataFrame([['Benchmark Method', accuracy, acc_val, precision, recall, F1, ROC]], columns = ['Model', 'Accuracy','Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC']) results_benchmark ``` ![](https://i.imgur.com/YTzDvYi.png) --- ### Logistic Regression ![](https://media.geeksforgeeks.org/wp-content/uploads/20200522215734/download35.jpg) 1. 加上L1懲罰項做正則化 > 減弱羅吉斯回歸模型的精度，提高泛化效果，避免過擬合(overfitting) 2. 隨機打亂數據 > *seed = 42* 方便比較 ```python= from sklearn.linear_model import LogisticRegression logReg = LogisticRegression(random_state = 42, penalty = 'l1', solver='liblinear') #l1 penalty logReg.fit(x_train_scaled, y_train_np) ``` --- ### Cross Validation for Logistic Regression 往後方法都透過*sklearn*套件中，*model_selection*包裡面的現成函數 **cross_val_score**來進行**交叉驗證** ```python= from sklearn.model_selection import cross_val_score ``` **10-fold validation** ```python= Kfold = model_selection.KFold(n_splits = 10, random_state = 42, shuffle = True) scoring = 'accuracy' #我們要看交叉驗證的準確率如何 ``` 驗證資料的準確率(accuracy)約為 > 96.4% ```python= acc_logReg = cross_val_score(estimator = logReg, X = x_train_scaled, y = y_train_np, cv = Kfold, scoring = scoring) acc_val = acc_logReg.mean() print(acc_val) ``` ![](https://i.imgur.com/l5ngEVG.png) --- ### Model Evaluation for Logistic Regression 先求出真實答案&預測結果 ```python= ground_truth = y_test['Risk'].to_numpy() predicted = logReg.predict(x_test_scaled) ``` 使用*sklearn*的現成函數計算performances ```python= accuracy = accuracy_score(ground_truth, predicted) precision = precision_score(ground_truth, predicted) recall = recall_score(ground_truth, predicted) F1 = f1_score(ground_truth, predicted) ROC = roc_auc_score(ground_truth, predicted) results_logReg = pd.DataFrame([['Logistic Regression', accuracy, acc_val, precision, recall, F1, ROC]], columns = ['Model', 'Accuracy','Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC']) results_logReg ``` ![](https://i.imgur.com/nyMNNqB.png) --- ### Random Forest ![](https://miro.medium.com/max/602/1*iWHiPjPv0yj3RKaw0pJ7hA.png) > Random Forest = Bagging + Decision Tree --- **Ensemble Learning** a.k.a. 盲人摸象訓練術XD 有下列2種技巧，以面臨學校大考的例子說明。 **Bagging** > 重複抽樣的概念。 > 狂買參考書來寫 **Boosting** > 和bagging類似，但更注重改正錯誤 > 也是買參考書，但加強比較弱的、之寫寫錯的部分 ![](https://quantdare.com/wp-content/uploads/2016/04/bb3-800x307.png) --- 參數說明： 1. **n_estimators**: 這個隨機森林裡面要包含幾棵決策樹? > 100棵樹 2. **criterion**: 用來選決策樹的分割特徵的準則 > a. Gini Impurity > b. Information Gain (by entropy) :heavy_check_mark: 3. 隨機打亂數據 > *seed = 42* 方便比較 ```python= from sklearn.ensemble import RandomForestClassifier RandF = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 42) RandF.fit(x_train_scaled, y_train_np) ``` --- ### Cross Validation for Random Forest **10-fold validation** ```python= Kfold = model_selection.KFold(n_splits = 10, random_state = 42, shuffle = True) scoring = 'accuracy' #我們要看交叉驗證的準確率如何 ``` 驗證資料的準確率(accuracy)約為 > 99.1% ```python= acc_RandF = cross_val_score(estimator = RandF, X = x_train_scaled, y = y_train_np, cv = Kfold, scoring = scoring) acc_val = acc_RandF.mean() print(acc_val) ``` ![](https://i.imgur.com/oI2tUEZ.png) --- ### Model Evaluation for Random Forest 先求出真實答案&預測結果 ```python= ground_truth = y_test['Risk'].to_numpy() predicted = RandF.predict(x_test_scaled) ``` 使用*sklearn*的現成函數計算performances ```python= accuracy = accuracy_score(ground_truth, predicted) precision = precision_score(ground_truth, predicted) recall = recall_score(ground_truth, predicted) F1 = f1_score(ground_truth, predicted) ROC = roc_auc_score(ground_truth, predicted) results_RandF = pd.DataFrame([['Random Forest', accuracy, acc_val, precision, recall, F1, ROC]], columns = ['Model', 'Accuracy','Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC']) results_RandF ``` ![](https://i.imgur.com/gooRdaA.png) 隨機森林的表現近乎完美！訓練資料完全命中！ --- ### K Nearest Neighbor (KNN) Algorithm ![](https://miro.medium.com/max/612/1*gC4Cr1MkXBMRgJSbHwHOpw.png) 參數說明：**(都使用預設參數，未做更動)** 1. **n_neighbors**: $K$的大小，就是要參考附近幾個點? > $K=5$ 2. **weights**: 附近點的權重? > 'uniform' (權重都均等) 3. **algorithm**: 使用的演算法 > a. brute force > b. kd_tree > c. ball_tree > d. auto :heavy_check_mark: 4. **leaf_size**: kd樹或ball樹的大小 > 30 (default) 5. **p = 2** > 歐式距離 6. **metric**: 度量距離的方式 > 'minkowski' (閔氏距離，就是p=2的歐式距離) ```python= from sklearn.neighbors import KNeighborsClassifier KNN = KNeighborsClassifier() KNN.fit(x_train_scaled, y_train_np) ``` --- ### Cross Validation for KNN **10-fold validation** ```python= Kfold = model_selection.KFold(n_splits = 10, random_state = 42, shuffle = True) scoring = 'accuracy' #我們要看交叉驗證的準確率如何 ``` 驗證資料的準確率(accuracy)約為 > 96.6% ```python= acc_KNN = cross_val_score(estimator = KNN, X = x_train_scaled, y = y_train_np, cv = Kfold, scoring = scoring) acc_val = acc_KNN.mean() print(acc_val) ``` ![](https://i.imgur.com/HFSMJzR.png) --- ### Model Evaluation for KNN 先求出真實答案&預測結果 ```python= ground_truth = y_test['Risk'].to_numpy() predicted = KNN.predict(x_test_scaled) ``` 使用*sklearn*的現成函數計算performances ```python= accuracy = accuracy_score(ground_truth, predicted) precision = precision_score(ground_truth, predicted) recall = recall_score(ground_truth, predicted) F1 = f1_score(ground_truth, predicted) ROC = roc_auc_score(ground_truth, predicted) results_KNN = pd.DataFrame([['K Nearest Neighbor', accuracy, acc_val, precision, recall, F1, ROC]], columns = ['Model', 'Accuracy','Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC']) results_KNN ``` ![](https://i.imgur.com/q6BxjXQ.png) 雖不及隨機森林和羅吉斯回歸的效果，但仍然相當好，都有90%以上。推測可能是因為只看附近的，所以效果上才不如前述兩者。 --- ### Support Vector Machine (SVM) ![](https://cg2010studio.files.wordpress.com/2012/05/non-linear-svms.jpg?w=540) 參數說明：**(部分使用預設參數)** 1. **C**: 錯誤項的懲罰係數越大訓練樣本越準，但泛化能力降低 > $C=1$ (default) 2. **kernel**: 核函數可想像成把資料「投影」到更高維度，以方便分類的「投影函數」 > 'rbf' (default) > 'linear' :heavy_check_mark: 3. **degree, gamma, coef0**: (略) > 我們選擇linear kernel function，用不到這3個參數 4. **probability**: 是否啟用機率估計 > False (default) > True :heavy_check_mark: 5. 隨機打亂數據 > *seed = 42* 方便比較 6. 其他參數保持預設，不一一列出 ```python= from sklearn.svm import SVC SVM = SVC(kernel = "linear", probability = True, random_state = 42) SVM.fit(x_train_scaled, y_train_np) ``` --- ### Cross Validation for SVM **10-fold validation** ```python= Kfold = model_selection.KFold(n_splits = 10, random_state = 42, shuffle = True) scoring = 'accuracy' #我們要看交叉驗證的準確率如何 ``` 驗證資料的準確率(accuracy)約為 > 97.6% ```python= acc_SVM = cross_val_score(estimator = SVM, X = x_train_scaled, y = y_train_np, cv = Kfold, scoring = scoring) acc_val = acc_SVM.mean() print(acc_val) ``` ![](https://i.imgur.com/fI6zKJe.png) --- ### Model Evaluation for SVM 先求出真實答案&預測結果 ```python= ground_truth = y_test['Risk'].to_numpy() predicted = SVM.predict(x_test_scaled) ``` 使用*sklearn*的現成函數計算performances ```python= accuracy = accuracy_score(ground_truth, predicted) precision = precision_score(ground_truth, predicted) recall = recall_score(ground_truth, predicted) F1 = f1_score(ground_truth, predicted) ROC = roc_auc_score(ground_truth, predicted) results_SVM = pd.DataFrame([['Support Vector Machine', accuracy, acc_val, precision, recall, F1, ROC]], columns = ['Model', 'Accuracy','Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC']) results_SVM ``` ![](https://i.imgur.com/x7tUgWJ.png) 結果略差於隨機森林，略好於KNN，和羅吉斯回歸差不多。 (RF > SVM $\approx$ LogReg > KNN) 仍然是相當不錯的分類表現，甚至都有超過95%。 --- ### Neural Network ![](https://www.tibco.com/sites/tibco/files/media_entity/2021-05/neutral-network-diagram.svg) 從**tensorflow**的**keras**匯入需要的方法，比如模型部分 ```python= from tensorflow.keras import backend as K from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense ``` --- 自定義計算精確率、召回率、F1分數的函式。(這部分參考老師的demo檔案) ```python= def recall_m(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) recall = true_positives / (possible_positives + K.epsilon()) return recall def precision_m(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1))) precision = true_positives / (predicted_positives + K.epsilon()) return precision def f1_m(y_true, y_pred): precision = precision_m(y_true, y_pred) recall = recall_m(y_true, y_pred) return 2*((precision*recall)/(precision+recall+K.epsilon())) ``` --- ### 建造神經網路 1. 模型外面框架 ```python= NN = Sequential() ``` 2. 輸入層(input layer)：神經元個數 = 變數數量 = **15個** ```python= x_train_scaled.shape ``` ![](https://i.imgur.com/bgsJAFg.png) ```python= NN.add(Dense(16, activation = 'relu', input_shape = (x_train_scaled.shape[1], ))) ``` 3. 第一隱藏層(hidden layer): (程式碼請見上方) > 16個神經元 > 激活函數(activation function): ***ReLU*** 4. 第二隱藏層(hidden layer): > 128個神經元 > 激活函數: ***ReLU*** ```python= NN.add(Dense(128, activation = 'relu')) ``` 5. 輸出層(output layer) > 只有**1**個神經元 → 給出預測結果 > 激活函數: ***Sigmoid*** → 因為是0或1的*二元預測* --- ### 組裝神經網路模型 ```python= NN.compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = ['accuracy', recall_m, precision_m, f1_m]) ``` > a. 優化器(optimizer): ***RMSProp*** > b. 損失函數(loss function): ***Binary Cross Entropy*** > c. 正確率衡量指標: 準確率***accuracy***、自定義的召回率、精確率、F1分數***recall_m, precision_m, f1_m*** --- ### 訓練神經網路 1. **20**個*epoch* 2. *batch大小* = **60** > 約訓練資料集的1/10大小 ```python= history = NN.fit(x_train_scaled, y_train_np, epochs = 20, batch_size = 60) ``` 訓練過程 ![](https://i.imgur.com/miUwYs0.png) --- ### 查看每次epoch的表現(plot it out) ```python= plt.figure(figsize = (25,5)) l = [ {'code':'loss', 'fullname':'Loss'}, {'code':'accuracy', 'fullname':'Accuracy'}, {'code':'precision_m', 'fullname':'Precision'}, {'code':'recall_m', 'fullname':'Recall'}, {'code':'f1_m', 'fullname':'F1 score'}, ] for i in range(5): plt.subplot(1, 5, i+1).set_title("Model "+l[i]["fullname"]) plt.plot(history.history[l[i]["code"]]) plt.ylabel(l[i]["fullname"]) plt.xlabel("Epoch") plt.tight_layout() plt.savefig("NN_performance_plots.png") ``` 由左至右：**Loss, Accuracy, Precision, Recall, F1 score** ![](https://i.imgur.com/0ljx028.png) 可以觀察到整體而言，損失有在下降，各項準確指標也有在上升。雖然上升過程有些微不穩定。四項準確指標都在約96-98%左右。 --- ### Model Evaluation for Neural Network 下列evaluation時的指標，我自己刻程式計算 ```python= if m['type'] == 'Neural Network': acc = (predicted == ground_truth).sum() / len(predicted) precision = ground_truth[np.where(predicted == 1)].mean() recall = predicted[np.where(ground_truth == 1)].mean() F1 = 2 * precision * recall / (precision + recall) print("accuracy:", acc) print("precision:", precision) print("recall:", recall) print("F1:", F1) print("AUC:", AUC) ``` ![](https://i.imgur.com/ioeuomW.png) --- ### ROC曲線&曲線下面積AUC 使用*sklearn*套件中，*metrics*包裡面的現成方法**roc_curve, roc_auc_score** ```python= from sklearn.metrics import roc_curve, roc_auc_score ``` 畫圖程式碼部分 ```python= y_test_np = y_test['Risk'].to_numpy() models = [ {'type': 'Benchmark', 'model': 'Benchmark'}, {'type': 'Logistice Regression', 'model': LogisticRegression(random_state = 42, penalty = 'l1', solver='liblinear')}, {'type': 'Random Forest', 'model': RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 42)}, {'type': 'K Nearest Neighbor', 'model': KNeighborsClassifier()}, {'type': 'Support Vector Machine', 'model': SVC(kernel = "linear", probability = True, random_state = 42)}, {'type': 'Neural Network', 'model': NN} ] #預先設定整張圖大小 plt.figure(figsize = (20,10)) #算出每個模型的ROC和AUC，然後把曲線疊加到圖中 for m in models: #我們需要特別處理我們手刻的benchmark，因為跟其他方法的結構不一樣，是自定義的 if m['type'] == 'Benchmark': predicted = benchmark_test #p = y.value_counts(normalize = True)[1]，樣本中有審計風險的公司比例。約為39.3% #n_test = len(x_test_scaled.index)，縮放後測試資料集的長度 pred_prob = np.array([p] * n_test) else: model = m['model'] if m['type'] == 'Neural Network': predicted = np.array([1 if model.predict(x_test_scaled)[i] >= 0.5 else 0 for i in range(len(model.predict(x_test_scaled)))], dtype = 'int64') #產生測試資料的預測結果(標籤) pred_prob = model.predict(x_test_scaled) #產生測試資料預測出的，每筆資料的各標籤的機率 else: model.fit(x_train_scaled, y_train_np) #訓練模型 predicted = model.predict(x_test_scaled) #產生測試資料的預測結果(標籤) pred_prob = model.predict_proba(x_test_scaled)[:, 1] #產生測試資料預測出的，每筆資料的各標籤的機率 #計算真陽性率(TP rate)和偽陽性率(FP rate) FP, TP, thresholds = roc_curve(y_test_np, pred_prob) #計算AUC，然後呈現在ROC圖的legend中 AUC = roc_auc_score(y_test_np, predicted) #接著，把結果疊加到ROC圖上 plt.plot(FP, TP, label = '%s ROC (area = %0.2f)' % (m['type'], AUC)) #一些圖片設定(標題、坐標軸標題等等) plt.plot([0, 1], [0, 1], 'k--') #參考線 plt.xlim([0.0, 1.0]) #限制x軸顯示範圍 plt.ylim([0.0, 1.05]) #限制y軸顯示範圍 plt.xlabel('Specificity(FP rate)') #x軸標題 plt.ylabel('Sensitivity(TP rate)') #y軸標題 plt.title('ROC Figure') #圖片標題 plt.legend() #圖片圖例 plt.savefig("ROC.png") #儲存圖片 ``` --- ### 各模型ROC&AUC結果比較 ![](https://i.imgur.com/jsHpzKN.png) --- ### :star: **Results** #### In-sample results (in percentage) This is the accuracy for validation data. |Measures |Benchmark Method|Logistic Regression|Random Forest|KNN|SVM|Neural Network| |---|:---:|:---:|:---:|:---:|:---:|:---:| | Accuracy |50.70|96.39|:crown: 99.14|96.56|97.59|97.77| #### Out-of-sample comparisons (in percentage) We consider a simple 75%-25% split on the data. In the test data, we have |Measures |Benchmark Method|Logistic Regression|Random Forest|KNN|SVM|Neural Network| |---|:---:|:---:|:---:|:---:|:---:|:---:| | Accuracy |51.55|98.45|:crown:100|96.91|99.48|97.94| | Precision|35.48|100|:crown:100|98.61|100|98.65| | Recall |28.95|96.05|:crown:100|93.42|98.68|96.05| |F1-score|31.88|97.99|:crown:100|95.95|99.34|97.33| |AUC|47.52|98.03|:crown:100|96.29|99.34|97.60| --- ### :star: **Conclusion** **Random Forest $>$ SVM $>$ Logistic Regression $>$ Neural Network $\gg$ KNN $>>>>>>>>>>$ Benchmark Method** 所以我們應該選取***隨機森林(Random Forest)***，當作我們的二元預測模型。 --- ### 混淆矩陣 (Confusion Matrix) 使用*sklearn*套件中，*metrics*包裡面的現成方法**confusion_matrix** ```python= from sklearn.metrics import confusion_matrix ``` 畫圖程式碼部分 ```python= predicted = [ {'model': 'Benchmark Method', 'predicted': benchmark_test}, {'model': 'Logistice Regression', 'predicted': logReg.predict(x_test_scaled)}, {'model': 'Random Forest', 'predicted': RandF.predict(x_test_scaled)}, {'model': 'K Nearest Neighbor', 'predicted': KNN.predict(x_test_scaled)}, {'model': 'Support Vector Machine', 'predicted': SVM.predict(x_test_scaled)}, {'model': 'Neural Network', 'predicted': np.array([1 if NN.predict(x_test_scaled)[i] >= 0.5 else 0 for i in range(len(NN.predict(x_test_scaled)))], dtype = 'int64')} ] plt.figure(figsize = (30, 20)) for i in range(len(predicted)): cm = confusion_matrix(y_test_np, predicted[i]['predicted']) plt.subplot(2, 3, i+1).set_title("Confusion Matrix of\n"+predicted[i]['model']) sns.heatmap(cm, annot = True, fmt = 'd') plt.savefig("Confusion_Matrixes.png") #annot: annotation標上數字。 #fmt = 'd': 標上去的數字取整數 ``` --- ### 各模型混淆矩陣結果比較從左上到右下： > **Benchmark Method/ Logistic Regression/ Random Forest** > **KNN/ SVM/ Neural Network** ![](https://i.imgur.com/uKn8dsu.png) 可以發現所有模型都遠勝於Benchmark Method。至於Benchmark Method外，其他5個模型之間的表現，請見***ROC&AUC***部分 --- ### Feature Importances (for random forest) 1. Random Forest = 很多 Decision Tree 2. 每棵樹都會有自己挑選的辦法 > 對每棵樹而言，分類用到的變數不同，重要性也不同 #### Q: What is the overall influence for each variable? --- 使用到隨機森林分類器的內建屬性(attribute) -> ***feature_importances_*** ```python= importances = RandF.feature_importances_ indices = np.argsort(importances)[::-1] #重新編排各個變數(特徵)，使得他們對應到上面的這個重要性順序 features = [x.columns[i] for i in indices] ``` 畫圖部分程式碼 ```python= #創造一張空圖 plt.figure() #圖片標題 plt.title("Feature Importances\n for Random Forest") #加上每個特徵重要性的柱狀圖 #range(x.shape[1])代表所有0~行數-1 plt.bar(range(x.shape[1]), importances[indices]) #把每個特徵(變數)的名字標註到x軸 plt.xticks(range(x.shape[1]), features, rotation = 90) #儲存圖片 plt.savefig("Feature Importances.png") ``` --- ### 變數/特徵的重要性排序圖 (Feature Importance Plot) ![](https://i.imgur.com/mE3xzDr.png) 可以看到最重要的分類特徵是變數**Inherent_Risk**，也就是說： **總固有風險**部分，是判斷一個公司是否可能詐欺的一項重要指標。 --- ## Future Works 如果能取得台灣公家企業(或普通企業)的風險分數資料，也可針對台灣的企業進行類似分析。 --- ## 5. References [我的資料來源 my dataset (from Kaggle)](https://www.kaggle.com/sid321axn/audit-data) [參考論文：Fraudulent Firm Classification: A Case Study of an External Audit](https://www.researchgate.net/publication/323655455_Fraudulent_Firm_Classification_A_Case_Study_of_an_External_Audit) [老師的demo檔案](https://drive.google.com/file/d/1C4tRbnbBhWcBpIsC4uaXkmbGBELegQPY/view?usp=sharing) --- ## 6. Data and Code Storage Please refer to my E3. ---