ML & FinTech Project: The Customer Credit Default Risk Prediction by 李奕萱

# ML & FinTech Project: The Customer Credit Default Risk Prediction by 李奕萱 #### keywords: credit default risk, logistic regression, random forest, lightGBM --- ## Motivation The topic I want to do is **<font color="LightCoral">Customer credit default risk prediction</font> model (客戶信用違約風險預測)** ### :star: The importance of credit default prediction problem [回顧2005年-2006年之台灣卡債風暴](https://www.npf.org.tw/2/3558) With reference to the the credit card and cash card crisis in Taiwan in 2005, due to the excessive issuance of credit cards and cash cards by banks, card debts were formed when individuals were unable to repay, resulting in bank debts. ### :star2: Improvement for credit default prediction model [中信線上貸 3分鐘撥款](https://udn.com/news/story/7239/5754771) In order to attract customers, the current lending business often emphasizes its fast loan approval and convenient procedures, thus testing the speed and accuracy of its credit default prediction model. --- ## EDA data is collected from [default of credit card clients Data Set, UCI](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) ### :star: Data description | Feature name | Explanation | | -------- | -------- | | X1 | Amount of the given credit (NT dollar) | | X2 | Gender (1=male; 2=female) | | X3 | Education (1=graduate school; 2=university; 3=high school; 4=others) | | X4 | Marital status (1=married; 2=single; 3=others) | | X5 | Age (year) | | X6-X11 | History of past payment (X6=the repayment status in 2005/9; ...; X11=the repayment status in 2005/4.) The measurement scale for the status is: -1=pay duly; 1=payment delay for 1 month; 2=payment delay for 2 months| | X12-X17 | Amount of bill statement (NT dollar) | | X18-X23 | Amount of previous payment (NT dollar)| ### :star2: EDA ### ![](https://i.imgur.com/d6nLqKS.png) Briefly browse the data. ![](https://i.imgur.com/KW3MgPB.png) ![](https://i.imgur.com/7Ebwt3R.png) There is no missing value. But there are some special value in 'EDUCATION' and 'MARRIAGE'. (For example, there is no definition for 0 in 'MARRIAGE') ![](https://i.imgur.com/uigYpYJ.png) About 22.1% of clients will default next month. ![](https://i.imgur.com/w3UgIj3.png) There is a higher percentage of male defaulting next month. ![](https://i.imgur.com/XWgSIa8.png) ![](https://i.imgur.com/LP6bzF5.png) The higher the education level, the lower the default rate. ![](https://i.imgur.com/jbkdzQs.png) ![](https://i.imgur.com/PQWwf7B.png) People who are single has lower percentage of defaulting next month. ![](https://i.imgur.com/xz6QtRB.png) ![](https://i.imgur.com/YkULezc.png) Most clients aged in range (26, 35). For some group, there are extra high number of people, it's hard to observe the relationship between age and default. ![](https://i.imgur.com/XRbVmOQ.png) Divide the 'AGE' into five groups, and we can see that people aged 30-40 have a lower default rate. But no obvious difference among these groups. ![](https://i.imgur.com/qWdo6H9.png) For amount of bill statement and Repayment status in 2005/4 - 2005/9, the correlation is higher when the distance of months is closer. But for the amount of previous payment in 2005/4 - 2005/9, there is no obvious correlation between them. --- ## Problem formulation and methods ### :star: Problem formulation We would like to predict $y$ from $x=(x_1,x_2,...,x_p)$ with $p=23$ ### :star2: Benchmark method * We use **Logistic regression** as a benchmark method ```python= lr = LogisticRegression(random_state=42, solver='lbfgs', max_iter=1000) #using l2 penalty lr.fit(X_train, y_train) ``` * Split the data using **75-25 train-test split** ```python= X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0) ``` * Use **AUC** as a performance measurement ### :star2: Model we used 1. Random Forest ```python= rf = RandomForestClassifier(n_estimators=40, max_depth=17, min_samples_leaf=30, min_samples_split=120, max_features=7, random_state=42) rf.fit(X_train, y_train) ``` 2. LightGBM ```python= lgbm = lgb.LGBMClassifier(num_leaves=26, max_depth=6, learning_rate=0.05,objective='binary', colsample_bytree=0.65, reg_lambda=0.3, min_child_samples=70, random_state=42) lgbm.fit(X_train, y_train) ``` --- ## Analysis and Conclusion ### Data Preprocessing * Abnormal value * Numerical data: Min-Max normalization $X_{nom}=\dfrac{X-X_{min}}{X_{max}-X_{min}}\in[0, 1]$ * Categorical data: One-hot encoding We use **K-Fold cross validation (K=10)** to validate our model ### :star: In-sample results | Measures | Logistic Regression | Random Forest | LightGBM | | -------- | -------- | -------- | ------| | AUC | 0.719 | 0.784 | 0.785 :crown: | ### :star: Out-of-sample results | Measures | Logistic Regression | Random Forest | LightGBM | | -------- | -------- | -------- | ------| | AUC | 0.601 | 0.653 | 0.654 :crown:| ### Conclusion Both Random Forest and LightGBM have **better performance** than Logistic Regression. However, the performance of Random Forest and LightGBM is not much different. --- ## Reference [Default of Credit Card Clients Dataset | Kaggle](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset) [User guide -- scikit-learn 1.0.2 documentation](https://scikit-learn.org/stable/user_guide.html) [Python API -- LightGBM 3.3.1.99 documentation](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api) [[Day29]機器學習：交叉驗證！](https://ithelp.ithome.com.tw/articles/10197461) [scikit-learn随机森林调参小结 - 博客园](https://www.cnblogs.com/pinard/p/6160412.html) [LightGBM 调参方法（具体操作） - 博客园](https://www.cnblogs.com/bjwu/p/9307344.html)