--- title: 20220912 Demo04 tags: tools --- # 信用評等模型 #### keywords: credit scoring, binary classification, credit card holders ## 1. Motivations Motivations: In this project, I aim at building a proper credit scoring model, that helps to predict the default probability of a debt lender and decide whether a loan is issued to a customer. ![](https://i.imgur.com/8lfeX74.png) ## 2. Data visualization ### Data description |Variable :construction:|Column|Descriptions|Type:construction:| |--|--|---|---| ||ID|ID of each client| |$x_1$|LIMIT_BAL| Amount of given credit in NT dollars (includes individual and family/supplementary credit| |$x_2$|SEX| Gender (1=male, 2=female)| |$x_3$|EDUCATION| (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown) |$x_4$|MARRIAGE: Marital status (1=married, 2=single, 3=others)| |$x_5$|AGE| Age in years| |$x_6$|PAY_0| Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)| ||PAY_2| Repayment status in August, 2005 (scale same as above)| ||PAY_3| Repayment status in July, 2005 (scale same as above)| ||PAY_4| Repayment status in June, 2005 (scale same as above)| ||PAY_5| Repayment status in May, 2005 (scale same as above)| ||PAY_6| Repayment status in April, 2005 (scale same as above)| ||BILL_AMT1| Amount of bill statement in September, 2005 (NT dollar)| ||BILL_AMT2| Amount of bill statement in August, 2005 (NT dollar)| ||BILL_AMT3| Amount of bill statement in July, 2005 (NT dollar)| ||BILL_AMT4| Amount of bill statement in June, 2005 (NT dollar)| |$x_{16}$|BILL_AMT5| Amount of bill statement in May, 2005 (NT dollar)| |$x_{17}$|BILL_AMT6| Amount of bill statement in April, 2005 (NT dollar)| |$x_{18}$|PAY_AMT1| Amount of previous payment in September, 2005 (NT dollar)| |$x_{19}$|PAY_AMT2| Amount of previous payment in August, 2005 (NT dollar)| |$x_{20}$|PAY_AMT3| Amount of previous payment in July, 2005 (NT dollar)| |$x_{21}$|PAY_AMT4| Amount of previous payment in June, 2005 (NT dollar)| | $x_{22}$|PAY_AMT5| Amount of previous payment in May, 2005 (NT dollar)| |$x_{23}$|PAY_AMT6| Amount of previous payment in April, 2005 (NT dollar)| |$y$|default.payment.next.month| Default payment (1=yes, 0=no)| Histograms and density plots of numerical variables: These features appear to skewed to the right. We may need to consider a log-transformation. ![](https://i.imgur.com/lQflpeM.png) Histograms and density plots of numerical variables after logarithm: After log-transformation, these features appear more symmetrically distributed. ![](https://i.imgur.com/t46tpqJ.png) Pairwise scatter plots: It's difficult to see patterns between variables. :construction: ![](https://i.imgur.com/gcIsgdM.png) A heat map of pairwise correlations: Time related variables are more highly correlated. $y$ is more highly correlated with $x_6$,$\ldots$, $x_{11}$. (:question:) ![](https://i.imgur.com/SfaVAey.png) :construction: A table of descriptive statistics ## 3. Problem formulation and our methods ### Problem formulation We sould like to build a model to predict $y$ from $x=(x_1,\ldots,x_p)$ with $p=23$. ### Benchmark method The benchmark model is the logisitic regression: $$y\sim Bernoulli(\frac{1}{1+\exp^{-\beta' x}}),$$ with $\beta= (\beta_1,\ldots,\beta_p)'$. ### Other machine learning techniques ### Our study plan We would like to compare the performances of Logistic regression, KNN, Decision tree, Neural network. ### Results #### In-sample results |Measures |Logistic Regression| |---|----| | Accuracy || | Precision|| | Recall || |F1-score|| |AUC| | #### Out-of-sample comparisons We consider a simple 80%-20% split on the data. In the test data, we have |Measures |Logistic Regression|Neural Network| |---|---|---| | Accuracy || | Precision|| | Recall || |F1-score|| |AUC| | ## 4. Conclusion ## 5. Reference 1. ## 6. Data and Code