# Project solution
# Members:
* Carlos Mauricio Bula Oyuela
* Miguel Morales Velásquez
* Rafael Villca Poggian
## 1. Data Loading
Data loading is performed using custom function in data_loader.py that loads a loader_config.yaml with the location of the files. Data is downloaded from a S3 bucket. It is divided in two files, in csv and parquet format, which are downloaded using boto3 client. It performs a request to the bucket, giving a byte stream in the response, which is then passed to the reading functions of pandas. It can filter columns of files. Finally, a merged is made if both files locations are specified in the config file. If not, only a single file is returned.
## 2. Exploratory Data Analysis
We performed the following steps to perform the data analysis:
* Filter only `Charged Off`, `Default`, and `Fully Paid`. Binarize them.
* Select only Individual applications and discard columns related to join applications.
* Drop columns with lots of nans (>30%).
* Check for not useful columns (urls, descriptions, employment title, and title of the application) and data leakage (e.g. last payment date).
* Grouped categories for address data in its 10 regions using mapping (50 states).
* Discard zip codes, due to lots of categories (>900) and we already have regions.
* Plot categorical distributions grouped by target to see if there is a difference.
* Perform Chi2 and ANOVA to support the selection of categorical and numerical features.
* Use correlations to filter numerical selection, to keep only those who are uncorrelated.
After all the steps above we selected the following columns:
1. Loan Purpose
2. State of the Address
3. Home Ownership
4. Employment Length
5. High Credit Score
6. Debt to Income Ratio
7. Trades Opened: Past 24 Months
8. Open to Buy Bank Cards
9. Average Balance
10. High Credit Limit
11. Mortgage Accounts
12. Accounts Opened: Past 12 Months
13. Total Current Balance
14. Total Bankcard Limit
15. Loan Amount
16. Issue Date
## 3. ETL
A data transformation function is defined in data_loader.py which filters out examples of join applications and examples with missing values in the target variable. Then, 'Default' and 'Charged Off' classes are combined in one class, our class of interest with code '1', and 'Fully Paid' is then defined with code '0'. This is to frame the problem in a binary classification problem, so we can use the tools for those kind of problems to create the solution.
Then, during training the data is passed through the pipeline after a train-test split. The pipeline es created using the function *create_pipeline* from etl_pipeline.py. It receives a configuration dictionary with the format of config.yaml, which is flexible enough to define columns that go through the different paths of the pipeline. At first, dates and state feature go through a mapping that extracts month and groups US states in US regions. The subsequent paths do different transformations for different type of data. The following image presents a descriptive diagram:

Ordinal and OneHot encoders paths accept categorical features and Standar scaler accepts numerical features. Each module filters out only the columns needed and drop the rest. At the end, the outputs of those modules are joined and outputed.
## 4. Modeling
We decided to train three different models and compare them using F1 Score
### 4.1 Baseline
A Baseline Random Forest was trained with parameters:
* n_estimators = 10
* class_weight = `"Balanced_subsample"`
* max_depth = 10
It must be noted that for this model, due to the class imbalance of the dependant variable, we performed an undersampling to have an even ratio of the categories
We ended up with an F1 score of 0.3907
### 4.2 XGBoost
A Boosting model based on the XGBoost library, trained with grid search. The best performing parameters were:
* Objective = `"binary:logistic"`
* early_stopping_rounds = 20
* eval_metric = `"logloss"`
* learning_rate = 0.05
* n_estimators = 30
* max_depth = 5
* scale_pos_weight = 4
We ended up with an F1 score of 0.3941
### 4.3 LightGBM
A Boosting model based on the LightGBM library, trained with grid search. The best performing parameters were:
* learning_rate = 0.05
* max_depth = 5
* n_estimators = 30
* objective = `"binary"`
* is_unbalance = `True`
* feature_fraction = 0.5
* bagging_fraction = 0.5
* num_leaves = 50
* bagging_freq = 15
We ended up with an F1 score of 0.3515
### 4.4 Explainations
For the explainability of the predictions we used SHAP values, due to its simplicity to use and understandable results.
### 4.5 Using Files
To train run `python train.py` at `src/eda`
Change `config.yaml` and `loader_config.yaml` acordingly.
## 5. Backend App
The server side application was developed using FastAPI inside `src/backend/app`. It exposed three endpoints for predictions
`GET /api/baseline`
`POST /api/boosting`
`POST /api/lightgbm`
For the POST endpoints we made use of dependency injection to handle JSON and HTML Form encoded inputs.
All the models were stored in `src/backend/app/models/files`, inside the app to speed up the startup process.
## 6. Deployment
The application was containerized using docker, the built image was pushed to AWS ECR, deployed to AWS Lambda, and made accessible to the client through AWS API Gateway.

### 6.1 Replicate deployment
* Build docker image at `src/backend`
* Push to ECR
## 7. Tests
The app code contained unit tests for each endpoint, but as required, a jupyter notebook with requests to the endpoints was added at `/tests`.