HOME CREDIT DEFAULT RISK

--- title: 'HOME CREDIT DEFAULT RISK' disqus: hackmd --- PROPOSAL: HOME CREDIT DEFAULT RISK ![](https://i.imgur.com/Hoi5H4g.png) --- Table of content [TOC] # Introduction Many people struggle to get loans due to insufficient or non-existent credit histories. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. However, there is a default risk of these underserved borrower. Thus, in order to make sure the ability of repayment, I will use various statistical and machine learning methods to make these predictions. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful. # Methodology ## Data: ### Overview - The data is collected from Kaggle([https://www.kaggle.com/c/home-credit-default-risk/data](https://)) - There are 7 tables as following: ![](https://i.imgur.com/kaxGnng.png) I can see that the data can be divided into three categories: - Applicant-level data which contains information about the applicant, such as education, number of family members, car owned, etc. - Bureau-level data which provides historical transactional information and credit balance information. - Other data, including external data from other data sources such as credit scores from other platforms, etc. - Problems: *The data is imbalanced* ``` | | Count_Target | Ratio_Target | |-------| -------------| ------------ | | 0 | 282686 | 0.919271 | | 1 | 24825 | 0.080729 | - Solutions: - Option 1:**Undersampling, oversampling and generating synthetic data.** However, when using a resampling method, we show the wrong proportions of the two classes to the classifier during the training. The classifier learned this way will then have a lower accuracy on the future real test data than the classifier trained on the unchanged dataset. - Option 2: find **new additional features** that can help distinguish between the two classes and, so, improve the classifier accuracy. ### What will I do? I will analyze from the bottom tables. - For example, **Bureau Balance** table is connected to **Bureau table** with a key as `SK_ID_BUREAU` and also **Bureau table** is connected to **Application Train/Test tables** with a key as `SK_ID_CURR`. - So, I start to analyze **Bureau Balance** first! Then, deduplicate the data according to the **SK_ID_BUREAU** variable and generate new variables to use on **Bureau table**. After that, the same processings should apply from **Bureau table** to **Application Train/Test tables** by using `SK_ID_CURR`. - Analysis other bottom tables again to transfer information up! - However, first of all, there are data analysis steps for each table, including: - EDA(Missing value, High_correlation, Outliers) - Data Pre-processing (One_hot_encoding, MinMaxscaler or Standizescaler) - Generate New Features (Feature Engineering): - Intersection of different features: For example, if we are given AMT_ANNUITY (the annuity of each credit loan), AMT_INCOME_TOTAL (total income of the applicant per annum), DAYS_EMPLOYED (total days of being employed), then we can create some intersected features, such as AMT_INCOME_TOTAL / DAYS_EMPLOYED and AMT_CREDIT / AMT_INCOME_TOTAL, which may add more information to this model and add some nonlinear capability. - Aggregations: Usually, we create groups based on certain features. Then we extract some statistical features, like maximal, minimum, mean and standard deviation values. - Finally, merge all tables with Application Train/Test ## Model: - After sum up features to the **Application Train/Test table**, do train_test_split - Option 1: Using Keras to define the model and class weights to help the model learn from the imbalanced data. - Option 2: Logistic Regression - Option 3: Decision Tree - Option 4: Random Forest - Option 5: Using LightGBM model - Metric: F1, precision, recall and confusion matrix, AUC. - Loss funciton: Binary_logloss ## Optimized: - In case of binary model: (https://lightgbm.readthedocs.io/en/latest/Features.html) - Tuning parameters `params = {'num_leaves': [], 'max_depth': [], 'n_estimators': [], ...}` - Using RanmizedSearchCV or GridSearchCV `RandomizedSearchCV( estimator = clf, param_distributions = param_test, n_iter = , scoring = 'F1', cv = 3, refit = False, random_state = , verbose = 10) ` - In case of neural network: - Tuning dense layers, batchnormal, dropout, .... ## Reference: - Light LGBM with simple feature: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features - LGBMClassifier: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html - LightGBM Classifier in Python: https://www.kaggle.com/prashant111/lightgbm-classifier-in-python - Tensorflow for imbalanced data: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#train_the_model ## Timeline: - The weekend [7-8/6] [ ] EDA(Missing value, High_correlation, Outliers) [ ] Data Pre-processing (One_hot_encoding, MinMaxscaler or Standizescaler) [ ] Generate New Features - Mon-Wed [9-11/6] Try models: [ ] LGBM, [ ] neural network, [ ] oversampling and undersampling - Thus-Fri [12-13/6] [ ] Decide which model will be used [ ] Debug - Weekend [14-15/6] [ ] Prepare streamlit and presentation