--- title: ML&F tags: teaching --- # ML & FinTech: 小額貸款的違約風險評估 ### Project by 黃子軒 #### keywords: 小額貸款, 違約風險評估 --- ## 1. 動機 Motivations --- [曾多達5,000家 中國銀保監會:P2P網貸機構已歸零](https://udn.com/news/story/7333/5048403) 小額貸款的存在,是對部份小資家庭或學生十分友善的存在。卻因為違約情況過多以及難以監管的問題而發生大量小額貸款公司消失,因而想做一個幫助公司判斷客戶違約風險的系統幫助減少違約風險 --- ## 2. 資料視覺化 Exploratory data analysis 資料來源(https://www.kaggle.com/chenyangxxxxxxx/p2p-lending) ### 數據說明 | 名稱 | 解釋 | | ---|---| |Id | Borrower’s id | |Loan_amnt | Amount of loan | |Funded_amnt | Amount of fund | |Term| Term of loan | |Int_rate| Interest rate| |Installment | Monthly payment | |Emp_length| Borrower’s employment length | |Home_ownership | Borrower’s home ownership| |Annual_inc|Borrower’s annual income | |Loan_status | Status of loan| |Purpose| Purpose of the loan | |Addr_state| Borrower’s residence state| |dti| Borrower’s debt-to-income ratio | |Deling_2yrs| The frequency of the borrower’s delinquencies in the last 2 years | |earliestcrline| The month the borrower's earliest reported credit line was opened| |Mthssincelast_delinq | Amount of months since the borrower’s last deliquency | |open_acc| The number of open credit lines in the borrower's credit file. | |revol_bal | Total credit revolving balance| |total_acc|The total number of credit lines currently in the borrower's credit file | |out_prncp | Remaining outstanding principal for total amount funded | |total_pymnt| Payments received to date for total amount funded | |totalrecprncp | Principal received to date| |totalrecint| Interest received to date | ### EDA ##### 共10000筆資料, 22個特徵 ![](https://i.imgur.com/FjM7Waw.png) ##### 直接删除id, addr_state, earliest_cr_line, mths_since_last_delinq,purpose,Unnamed: 0等有大量缺值且較不相關的特徵 ##### 有9524筆資料, 17個特徵 ![](https://i.imgur.com/8EqGO6d.png) ![](https://i.imgur.com/rpZospW.png) ##### 對Loan_status分類資料使用one-hot把做處理,Current和Fully Paid分為好的貸款,其餘為違約貸款 ##### 在資料中約有94%好的貸款,6%違約貸款 ![](https://i.imgur.com/Hc8cpgV.png) ##### 對term分類資料使用label Encoding做處理 ![](https://i.imgur.com/RAXVHGP.png) ##### 對emp_length分類資料按年份長短做順序排序處理 ![](https://i.imgur.com/kKJsEV6.png) ##### 對home_ownership和purpose分類資料使用one-hot把做處理,每個分類多一個column,是該分類為1,不是為0 ![](https://i.imgur.com/WBlNEhS.png) ![](https://i.imgur.com/AtHpSU5.png) ##### 最後有9524筆資料, 34個特徵 ![](https://i.imgur.com/GB5znJH.png) ##### 可以從中看到Loan_status和total_rec_prncp, total_pymnt, out_prncp較有正相關性 ![](https://i.imgur.com/604Bq3r.png) --- ## 3. Problem formulation ### Benchmark method #### 使用Logistic regression做為Benchmark,並使用AUC作為評分標準 Y: Loan_status(1為好的貸款, 0為違約貸款) X: 其餘特徵 #### 使用train-test split把資料做8:2的分割 #### 所有data作為In-sample,test data 作為Out-of-sample | | In-sample | Out-of-sample | | ---| ---|---| | AUC| 0.79|0.80| --- ## 4. 資料分析 Data analysis #### 使用RandomForest和lightgbm做資料分析, 並使用AUC作為評分標準 ##### RandomForest | | In-sample | Out-of-sample | | ---| ---|---| | AUC| 0.97|0.81| ##### lightgbm | | In-sample | Out-of-sample | | ---| ---|---| | AUC| 0.95|0.84| ## 5. 結論 Conclusion 在out- of - simple時用lightgbm和RandomForest的結果比Logistic regression較好, 但lightgbm最佳 ## 6. 參考資料 Reference https://ithelp.ithome.com.tw/articles/10197461 https://www.gushiciku.cn/pl/pTU0/zh-tw https://www.delftstack.com/zh-tw/howto/python/remove-nan-from-list-python/ ## 7. 資料與程式存放 Data and code ##### https://drive.google.com/drive/folders/1jufl8yPvmeKcTVYCdsK1ipeJaJ1KyoXI?usp=sharing --- ---