Give Me Some Credit

--- title: Give Me Some Credit tags: research --- # Give Me Some Credit :::success Joint work of Huei-Wen Teng, Ming-Hsuan Kang, and Ian Lee. ::: [Latest excel](https://docs.google.com/spreadsheets/d/1NSz0H5LsoatibPGPK_X2dhjJLpIpeHLr/edit?usp=sharing&ouid=116898982810795859728&rtpof=true&sd=true) ### Terminologies 1. Features: raw, quadratic, cubic features - Raw features are those after working with dummy variables, z-score normalizing, and log-transformation. 2. Cluster: benchmark clustering, positive clustering ($K=1$ for no clustering) - When $K=1$, there is no clustering. 3. All computing time is recorded in seconds. ## :apple: Ian's to-do list for 2022/2/10 :::info **Claims to be empirically verified** - We search a balance between optimality and robustness. - We would like to provide a method showing that a proper clustering with resize help to improve prediction. - LG avoid the over fitting problem. - Equipped with positive clustering and REG resizer with two and three clusters , LG is competitive with XGboost - LG requires less computing time - The significance feature is that, positive clustering with a resizer - Numerical results: [20220207_outputs_new.xlsx](https://docs.google.com/spreadsheets/d/1NSz0H5LsoatibPGPK_X2dhjJLpIpeHLr/edit?usp=sharing&ouid=106025386039656780705&rtpof=true&sd=true) ::: ### Sheet 0 :o: **Aiming to display the three resizers in different feature sets** #### Study plan - Number of features in different polynomial transform. |Feature set| Numbers of features| |--|---| |Raw| 12 | |Qudrtic| 67 | |Cubic| 287 | - Resizer table visualization - Each resizer in different feature sets have standardized by z-score, so the resizers can be compared by same scale. ![](https://i.imgur.com/p0ofYOP.png) #### Summary - do we have to list some of the top highest resizer in different feature sets? :question: --- ### Sheet 1 **Aiming to examine the time cost the resizers in different polynomial transforms** #### Study plan | feature | resizer | time | | --------------------- | ------------------ | ---- | | raw, quadratic, cubic | **1**, REG, LR, IG | | - There are $3\times 4 = 12$ combinations of trials. - :o: Time cost of combinations of resizers and polynomial transform | | Raw | Quadratic | Cubic | | ----- | ---- | --------- | ------ | | **1** | 0 | 0 | 0 | | REG | 0.02 | 0.13 | 0.84 | | LR | 0.24 | 1.45 | 4.47 | | IG | 6.82 | 37.58 | 160.95 | #### Summary 1. Information gain resizer cost the longest time, followed by logistic regression, and then linear regression. 2. for polynomial order, cubic transformation has the most variables, so it cost the most time in all three resizer, followed by quadratic, and then raw. --- ### Sheet 2 **Aiming to examine the entropy (performance) of clustering** #### Study plan Provide an excel spread sheet to summarize the time of clustering. |Data|Feature|Cluster|Resizer|$K$|Entropy|Time (Train ONLY)| |---|---|--|--|---|---|---| |Train, test|Raw, Quadratic, Cubic|Benchmark, Positive|**1**, REG, LOG, IG|1,2,...,9||| - We have $2\times 3\times 2\times 4\times 9 = 432$ combinations of trials. - When $K=1$ (no clustering), we need to record results for different combinations of Cluster and Resizer. This is to double check if our algorithm remains correct. - Need to record which cluster to assign for each individual so that you don't have to cluster for the rest of analysis. - See how entropy varies across different combinations of clustering, resizer, and feature sets. #### New summary :::info - positive > benchmark - For benchmark: - IG > REG > LR > 1 - quadratic > cubic > raw (except IG) - For positive: - REG > IG > LR > 1 - hard to tell ::: :::success Raw: positive > benchmark - For benchmark: - IG > REG > LR > 1 - For positive: - REG > IG > LR > 1 - Quadratic Cubic ::: ![](https://i.imgur.com/aiIOUJp.png =500x) #### Summary for time - Positive clustering only used minority group to train the cluster method, thus the time cost is much more lower; while benchmark clustering apply on whole data and time cost is in exponential growth due to the $K$-mediods algorithm. | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |-|-|-|-|-|-|-|-|-|-| | Benchmark | 1495.59885 | 2644.518621 | 2867.46884 | 2560.14127 | 2260.8889 | 2005.080572 | 1951.219291 | 1699.372647 | 1451.971466 | | Positive | 18.98732829 | 30.79253173 | 25.8523047 | 28.98609233 | 26.59949231 | 24.86664939 | 22.63046217 | 20.75790811 | 20.58388448 | --- ### Sheet 3 :o: **Aiming to examine the AUC and log-likelihood of clsuter-then-predict** #### Study plan Provide the following table for the final analysis. |Data |Feature| Cluster | Resizer | K | ML | LL | AUC | Time (Train ONLY)| |---|---|---|---|---|---|---|---|---| |Train, Test|Benchmark, Quadratic, Cubic| Benchmark, Positive| 1, REG, LR, IG |1, 2, 3| LR, XGBoost| ||| - There are $2\times 3\times 2\times 4\times 3 \times 2 = 288$ combinations of trials. - The log-likelihood for multiple clusters is simply the sum of separate log-likelihood in each cluster. - Why XGBoost produces less LL's than LR? **ans: negative LL is applied** #### Summary for NLL :o: - LR - Positive > benchmark - For benchmark: - LR > REG > IG <> 1 - For positive: - REG > LR > 1 > IG ![](https://i.imgur.com/G8ISCZh.png =300x) - XGB - positive > benchmark - For benchmark - 1 > IG > LR >REG - For positive - LR <> 1 > IG > REG ![](https://i.imgur.com/sDcmHXH.png =300x) #### New summary for AUC - LR: - positive > benchmark - For benchmark - LR > REG > IG > 1 - quadratic > cubic > Raw - For positive - REG > LR > IG > 1 - quadratic > cubic > Raw ![](https://i.imgur.com/M2IFkDD.png =300x) - XGB - bechmark <> positive - For benchmark - 1 > IG > LR > REG - cubic > quadratic > raw - For positive - 1 <> IG <> LR > REG - cubic > quadratic > raw ![](https://i.imgur.com/X7mkws4.png =300x) - Logistic Regression versus XGBoost? - Overall, logistic regression produce more stable and robust result; while xgboost perfectly learn the training dataset and result in overfitting. With further tuning in xgboost, the performance after the trade off between training and testing dataset could casuing a better AUC. Note that a time-costing is also a issue whem applying cluster-then-predict method. #### Summary for time :o: | | Benchmark | Quadratic | Cubic | | --- | ----------- | ----------- | ----------- | | LR | 9.047474384 | 38.1504035 | 109.424077 | | XGB | 78.72551036 | 210.1229138 | 862.9889815 | --- ## To-do 1/5 1. check the correness of SVM. See [Prof. Lin vedio](https://www.cupoy.com/collection/00000168FF4AF517000000036375706F795F72656C656173654355/0000016913C6B9B0000000296375706F795F72656C656173654349) 2. Use [information gain]() as the weight. 3. Interpretatinos of the method (weight, cluster). ### Ian's earlier work 1. [Ian](https://hackmd.io/Qz6oNB2gR12gLQZL3CP4jQ): Ian's master thesis. 2. [Demo03](/erzE_K0uRq2DYeEOwmsISQ): demo for ML&FinTech course demostration. 3. [Clustering-and-Predict](https://hackmd.io/_zZr58POReOEFkywUhkk1Q): Ian's attemp to show the usefulness of clustering-and-predict 4. [A review on the Support Vector Machine (SVM)](https://hackmd.io/aym1kWqVTCulSCkQnGgwYw). ---- ## 1. Motivations Credit scoring models are usually involved with imbalanced data problem. We would like to show that a successful data mining could improve the current machine learning algorithm. ## 2. Our methodology: ### 2.1 Re-scaling Clustering algorithm is sensitive to the scale of the explanatory variable. We try to give each explanatory variable a weight to indicate its significance. - After nomalizaing the explainatory variables with min-max or $Z$ scoring, we scale the explanatory variable by multiplying it with a specific weight to form an set of re-scaled explanatory variables. - With this re-scaled set of explanatory variables, we do the $K$-medoids clustering. - How do we re-scale? We list the original one without re-scaling and provide three additional approaches. #### 2.1.1. Without re-scaling We let the weight be the **1**. #### 2.1.2. With regression - We run a regression $$y=\beta'x+\varepsilon,$$ where $\varepsilon\sim N(0,\sigma^2)$. - And we use the estimated $\beta$, denoted as $\hat{\beta}$ as the weight to re-scale the explanatory variable. #### 2.1.3 With logistic regression - Because the target variable is binary, we consider a logistic regression: $$y\sim Bernoulli(\sigma(\beta'x)),$$ where $Bernoulli(\cdot)$ denote the Bernoulli random variable and $\sigma(t) = 1/(1+exp(-t))$ is the sigmoid function. - We use the estimated $\beta$, denoted as $\hat{\beta}_{LR}$ as the weight to re-scale the explanatory variable. #### 2.1.4 With information gain - information gain which is computed by the entropy. ### 2.2 Positive-case Clustering Because the data is extremely imbalanced, we suggest to clustering according to the following steps: 1. suppose we split the data into postive and negatives cases, denoted by $P$ and $N$, respectively. 2. Fix the number of clusters $k$. Applying $k$-Medoids algorithm, we cluster the positive cases into $k$ groups: $P_1 ,P_2,\cdots, P_k$ with the centroids $C_1,C_2,\cdots,C_k$ respectively. 3. We split $N$ into $k$ groups according to their distances to the centroids $C_1,C_2,\cdots,C_k$. 4. We build a model from data $P_i\cup N_i$ for all $i$. ## 3. Data descriptions ### Exploratory data analysis (EDA) - All explanatory variables are about personal information. - Here is the original explainatory variables. | Notation | Feature name | Description | Type | | ----- | -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | | $y$ | Target | Person experienced 90 days past due delinquency or worse | Binary | | $x_{1}$ | LogRevolveUtilize* | Total balance on credit cards and personal lines of credit in logarithm except real estate and no installment debt like car loans divided by the sum of credit limits | Numerical | | $x_{2}$ | age | Age of borrower in years | Numerical | | $x_{3}$ | 30-59DaysPastDue | Number of times borrower has been 30-59 days past due but no worse in the last 2 years | Numerical | | $x_{4}$ | LogDebtRatio* | Monthly debt payments, alimony,living costs divided by monthly gross income in logarithm | Numerical | | $x_{5}$ | LogMonIncome* | Monthly income in logarithm | Numerical | | $x_{6}$ | ISNALogMonIncome | Index of whether LogMonIncome is NA, 1 denote Na and 0 denote value exists | Binary | | $x_{7}$ | NOfCreditLineAndLoan | Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) | Numerical | | $x_{8}$ | 90DaysLate | Number of times borrower has been 90 days or more past due | Numerical | | $x_{9}$ | NRealEstateLoans | Number of mortgage and real estate loans including home equity lines of credit | Numerical | | $x_{10}$ | 60-89DaysPastDue | Number of times borrower has been 60-89 days past due but no worse in the last 2 years | Numerical | | $x_{11}$ | NOfDependents | Number of dependents in family excluding themselves (spouse, children etc.) | Numerical | | $x_{12}$ | ISNANOfDependents | Index of whether NOfDependents is NA, 1 denote Na and 0 denote value exists | Binary | * Due to the extreme positive skewness, three variables, RevolveUtilize, DebtRatio, MonIncome, have been taken logarithm * Due to missing values exist in two features, LogMonIncome and NOfDependents, dummay variable that indicate whether missing value is existed are created, which are ISNALogMonIncome and ISNANOfDependents. --- ### Pairwise scatterplot of feature Pairwise scatterplot and histogram. ###### Note that variable RevolveUtilize, DebtRatio, MonIncome have been taken log transformation. ![](https://i.imgur.com/uHrxXvh.png) --- ### Descriptive Statistics |Variable| NA (%) |Mean|Std| min| $Q_1$|Median | Max| Skewness|Kurtosis| ![](https://i.imgur.com/fhWAAGU.png) ![](https://i.imgur.com/53O6YxE.png) ### Missing value transformation - Use a dummy variable to split the column having missing data into two columns. - Suppose the original data is: |Index| data| | ---|---| | 0 | 32.4| | 1 | missing| | 2 | 15.7| | 3 |-2.2 | - Transform the data into the following: |Index| indicator| modified data| | ---|---|---| | 0 | 0 |32.4| | 1 | 1| 0 | | 2 | 0 |15.7| | 3 | 0 | -2.2 | ## 4. Study plan ### 4.1. Whether a proper clustering technique can reduce entropy? - To compare the clustering results, we use the entropy from information theory. - The entropy $h$ is defined as follows. For the $i$-th cluster, let $p_i= |P_i|/|P_i\cup N_i|$ be the proportion of positive cases. Then $$h = \sum_i \frac{|P_i\cup N_i|}{|P\cup N|}\left( p_i \log(p_i)- (1-p_i)\log(1-p_i)\right) $$ ### 4.2 Model performances comparisons - Now, it is time to do the real classification and compare its results in **AUC**. We first just consider Logistic regression and report the AUC. To predict an instance: 1. Assign it into the cluster with the closest cluster's centroid. 2. Predict by the cluster's submodel. ## Conclusion ## Appnedix