20220912 Demo03

--- title: 20220912 Demo03 tags: tools --- # Give me some credits This demo is still under construction. ## Motivations Give Me Some Credit Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years. The "Give Me Some Credit" data set was retrieved from a [Kaggle](https://www.kaggle.com/c/GiveMeSomeCredit) contest in 2011. ## Exploratory data analysis (EDA) All explanatory variables are about personal information. Here is the original explainatory variables. | Feature name | Description | Type | | -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | | Target | Person experienced 90 days past due delinquency or worse | Binary | | LogRevolveUtilize* | Total balance on credit cards and personal lines of credit in logarithm except real estate and no installment debt like car loans divided by the sum of credit limits | Numerical | | age | Age of borrower in years | Numerical | | 30-59DaysPastDue | Number of times borrower has been 30-59 days past due but no worse in the last 2 years | Numerical | | LogDebtRatio* | Monthly debt payments, alimony,living costs divided by monthly gross income in logarithm | Numerical | | LogMonIncome* | Monthly income in logarithm | Numerical | | NOfCreditLineAndLoan | Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) | Numerical | | 90DaysLate | Number of times borrower has been 90 days or more past due | Numerical | | NRealEstateLoans | Number of mortgage and real estate loans including home equity lines of credit | Numerical | | 60-89DaysPastDue | Number of times borrower has been 60-89 days past due but no worse in the last 2 years | Numerical | | NOfDependents | Number of dependents in family excluding themselves (spouse, children etc.) | Numerical | Due to the extreme positive skewness, three variables, RevolveUtilize, DebtRatio, MonIncome, have been taken logarithm --- Pairwise scatterplot and histogram. Note that variable RevolveUtilize, DebtRatio, MonIncome have been taken log transformation. ![](https://i.imgur.com/uHrxXvh.png) --- :construction: Give us a table of descriptive statistics of the data: mean, sd, skewness, kurtosis, min, Q1, Median, Q3, max. ![](https://i.imgur.com/fhWAAGU.png) ![](https://i.imgur.com/53O6YxE.png) ## Problem formulation ### Problem definition This is a supervised learning problem: we would like to use $x$ to predict the binary variable $y$. ### Measures for comparisons We first consider a 80%-20% split to obtain the train and test data. For both the train and test data, we report accuracy, precision, recall, f1-score, and AUC for model comparisons. ## Analysis ### Data Preprocessing Since the features are all numerical, the following preprocessing are implemented: 1. Z-Score Standardization:$\frac{x-\mu}{\sigma}$ 2. Since there are two features with missing value, imputation by mean is applied. Note that the missing ratios are provided in the summary table above. - :question: Why we need imputation by mean? If there is a missing data, you have to show missing ratio. - A better approach is to use a dummy variable to split the original column with missing data into two columns as follows. Suppose the original data is: |Index| data| | ---|---| | 0 | 32.4| | 1 | missing| | 2 | 15.7| | 3 |-2.2 | Transform the data into the following: |Index| indicator| modified data| | ---|---|---| | 0 | 0 |32.4| | 1 | 1| 0 | | 2 | 0 |15.7| | 3 | 0 | -2.2 | ### Study plan We consider the following model: 1. Benchmark model: Logistic regression 2. SVM 3. Decision tree 4. XGBoost 5. LightGBM ### Results * Training dataset (in-sample evaluation) ![](https://i.imgur.com/Rp3nxrE.png) * Testing dataset (out-of-sample evaluation) ![](https://i.imgur.com/Ck1Mp1Z.png) * Performance visualization ![](https://i.imgur.com/R2QAOdO.png) ![](https://i.imgur.com/wmLKgOx.png) ## Conclusion ### Summary 1. XGboost has the best overall performance 2. XGboost is a tree-based model, it should and does outperfrom decision tree. ### Future work 1. Seek other tailor-made techniques. 2. Try other machine learning algorithms.