owned this note
owned this note
Published
Linked with GitHub
---
title: 'HOME CREDIT DEFAULT RISK'
disqus: hackmd
---
PROPOSAL: HOME CREDIT DEFAULT RISK

---
Table of content
[TOC]
# Introduction
Many people struggle to get loans due to insufficient or non-existent credit histories. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience.
However, there is a default risk of these underserved borrower. Thus, in order to make sure the ability of repayment, I will use various statistical and machine learning methods to make these predictions. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
# Methodology
## Data:
### Overview
- The data is collected from Kaggle([https://www.kaggle.com/c/home-credit-default-risk/data](https://))
- There are 7 tables as following:

I can see that the data can be divided into three categories:
- Applicant-level data which contains information about the applicant, such as education, number of family members, car owned, etc.
- Bureau-level data which provides historical transactional information and credit balance information.
- Other data, including external data from other data sources such as credit scores from other platforms, etc.
- Problems:
*The data is imbalanced*
```
| | Count_Target | Ratio_Target |
|-------| -------------| ------------ |
| 0 | 282686 | 0.919271 |
| 1 | 24825 | 0.080729 |
- Solutions:
- Option 1:**Undersampling, oversampling and generating synthetic data.** However, when using a resampling method, we show the wrong proportions of the two classes to the classifier during the training. The classifier learned this way will then have a lower accuracy on the future real test data than the classifier trained on the unchanged dataset.
- Option 2: find **new additional features** that can help distinguish between the two classes and, so, improve the classifier accuracy.
### What will I do?
I will analyze from the bottom tables.
- For example, **Bureau Balance** table is connected to **Bureau table** with a key as `SK_ID_BUREAU` and also **Bureau table** is connected to **Application Train/Test tables** with a key as `SK_ID_CURR`.
- So, I start to analyze **Bureau Balance** first! Then, deduplicate the data according to the **SK_ID_BUREAU** variable and generate new variables to use on **Bureau table**. After that, the same processings should apply from **Bureau table** to **Application Train/Test tables** by using `SK_ID_CURR`.
- Analysis other bottom tables again to transfer information up!
- However, first of all, there are data analysis steps for each table, including:
- EDA(Missing value, High_correlation, Outliers)
- Data Pre-processing (One_hot_encoding, MinMaxscaler or Standizescaler)
- Generate New Features (Feature Engineering):
- Intersection of different features: For example, if we are given AMT_ANNUITY (the annuity of each credit loan), AMT_INCOME_TOTAL (total income of the applicant per annum), DAYS_EMPLOYED (total days of being employed), then we can create some intersected features, such as AMT_INCOME_TOTAL / DAYS_EMPLOYED and AMT_CREDIT / AMT_INCOME_TOTAL, which may add more information to this model and add some nonlinear capability.
- Aggregations: Usually, we create groups based on certain features. Then we extract some statistical features, like maximal, minimum, mean and standard deviation values.
- Finally, merge all tables with Application Train/Test
## Model:
- After sum up features to the **Application Train/Test table**, do train_test_split
- Option 1: Using Keras to define the model and class weights to help the model learn from the imbalanced data.
- Option 2: Logistic Regression
- Option 3: Decision Tree
- Option 4: Random Forest
- Option 5: Using LightGBM model
- Metric: F1, precision, recall and confusion matrix, AUC.
- Loss funciton: Binary_logloss
## Optimized:
- In case of binary model: (https://lightgbm.readthedocs.io/en/latest/Features.html)
- Tuning parameters
`params = {'num_leaves': [],
'max_depth': [],
'n_estimators': [],
...}`
- Using RanmizedSearchCV or GridSearchCV
`RandomizedSearchCV(
estimator = clf,
param_distributions = param_test,
n_iter = ,
scoring = 'F1',
cv = 3,
refit = False,
random_state = ,
verbose = 10)
`
- In case of neural network:
- Tuning dense layers, batchnormal, dropout, ....
## Reference:
- Light LGBM with simple feature: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features
- LGBMClassifier: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
- LightGBM Classifier in Python: https://www.kaggle.com/prashant111/lightgbm-classifier-in-python
- Tensorflow for imbalanced data: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#train_the_model
## Timeline:
- The weekend [7-8/6]
[ ] EDA(Missing value, High_correlation, Outliers)
[ ] Data Pre-processing (One_hot_encoding, MinMaxscaler or Standizescaler)
[ ] Generate New Features
- Mon-Wed [9-11/6]
Try models:
[ ] LGBM,
[ ] neural network,
[ ] oversampling and undersampling
- Thus-Fri [12-13/6]
[ ] Decide which model will be used
[ ] Debug
- Weekend [14-15/6]
[ ] Prepare streamlit and presentation