---
title: 20220912 Demo04
tags: tools
---
# 信用評等模型
#### keywords: credit scoring, binary classification, credit card holders
## 1. Motivations
Motivations: In this project, I aim at building a proper credit scoring model, that helps to predict the default probability of a debt lender and decide whether a loan is issued to a customer.

## 2. Data visualization
### Data description
|Variable :construction:|Column|Descriptions|Type:construction:|
|--|--|---|---|
||ID|ID of each client|
|$x_1$|LIMIT_BAL| Amount of given credit in NT dollars (includes individual and family/supplementary credit|
|$x_2$|SEX| Gender (1=male, 2=female)|
|$x_3$|EDUCATION| (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
|$x_4$|MARRIAGE: Marital status (1=married, 2=single, 3=others)|
|$x_5$|AGE| Age in years|
|$x_6$|PAY_0| Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)|
||PAY_2| Repayment status in August, 2005 (scale same as above)|
||PAY_3| Repayment status in July, 2005 (scale same as above)|
||PAY_4| Repayment status in June, 2005 (scale same as above)|
||PAY_5| Repayment status in May, 2005 (scale same as above)|
||PAY_6| Repayment status in April, 2005 (scale same as above)|
||BILL_AMT1| Amount of bill statement in September, 2005 (NT dollar)|
||BILL_AMT2| Amount of bill statement in August, 2005 (NT dollar)|
||BILL_AMT3| Amount of bill statement in July, 2005 (NT dollar)|
||BILL_AMT4| Amount of bill statement in June, 2005 (NT dollar)|
|$x_{16}$|BILL_AMT5| Amount of bill statement in May, 2005 (NT dollar)|
|$x_{17}$|BILL_AMT6| Amount of bill statement in April, 2005 (NT dollar)|
|$x_{18}$|PAY_AMT1| Amount of previous payment in September, 2005 (NT dollar)|
|$x_{19}$|PAY_AMT2| Amount of previous payment in August, 2005 (NT dollar)|
|$x_{20}$|PAY_AMT3| Amount of previous payment in July, 2005 (NT dollar)|
|$x_{21}$|PAY_AMT4| Amount of previous payment in June, 2005 (NT dollar)|
| $x_{22}$|PAY_AMT5| Amount of previous payment in May, 2005 (NT dollar)|
|$x_{23}$|PAY_AMT6| Amount of previous payment in April, 2005 (NT dollar)|
|$y$|default.payment.next.month| Default payment (1=yes, 0=no)|
Histograms and density plots of numerical variables: These features appear to skewed to the right. We may need to consider a log-transformation.

Histograms and density plots of numerical variables after logarithm: After log-transformation, these features appear more symmetrically distributed.

Pairwise scatter plots: It's difficult to see patterns between variables. :construction:

A heat map of pairwise correlations: Time related variables are more highly correlated. $y$ is more highly correlated with $x_6$,$\ldots$, $x_{11}$. (:question:)

:construction: A table of descriptive statistics
## 3. Problem formulation and our methods
### Problem formulation
We sould like to build a model to predict $y$ from $x=(x_1,\ldots,x_p)$ with $p=23$.
### Benchmark method
The benchmark model is the logisitic regression:
$$y\sim Bernoulli(\frac{1}{1+\exp^{-\beta' x}}),$$
with $\beta= (\beta_1,\ldots,\beta_p)'$.
### Other machine learning techniques
### Our study plan
We would like to compare the performances of Logistic regression, KNN, Decision tree, Neural network.
### Results
#### In-sample results
|Measures |Logistic Regression|
|---|----|
| Accuracy ||
| Precision||
| Recall ||
|F1-score||
|AUC| |
#### Out-of-sample comparisons
We consider a simple 80%-20% split on the data. In the test data, we have
|Measures |Logistic Regression|Neural Network|
|---|---|---|
| Accuracy ||
| Precision||
| Recall ||
|F1-score||
|AUC| |
## 4. Conclusion
## 5. Reference
1.
## 6. Data and Code