ML - Machine Learning Project

# ML - Machine Learning Project - This project consists of the development of a data mining case study. - The project consists of descriptive and predictive data mining tasks. ![](https://i.imgur.com/NWYkXDh.png) [Dataset](https://data.world/lpetrocelli/czech-financial-dataset-real-anonymized-transactions) [Challenge results](https://sorry.vse.cz/~berka/challenge/PAST/) ## BU - Business Understanding ### Analysis of requirements with the end user - The managers of a Czech bank seek to improve their understanding of customers and need quality loan predictions to improve services and minimize the risk of losing money. - The main requirement is to predict whether a loan will end successfuly based on the clients' profile and how they behaved in the past. - The predictions should be made taking into account the data that is already stored and was provided by the bank in 1999. We have data about the clients, their accounts, the loans already granted, the credit cards issued, etc. ### Definition of business goals - The goal of this problem is to improve the quickness and efficiency of the loan granting process, while reducing the long term monetary losses for the bank. That means that the bank needs to, for each loan request, make the best decision they can based on the client's data they have. - For that, the bank needs us to analyse and characterize the clients they should or should not give a loan to (by analysing their attributes and previous lone results), in order to know the reason why they should or not loan to a given client. - After describing the clients' profiles, we need to obtain a model to automatically predict whether a specific loan from a client should be accepted by the bank, that is, whether the loan's acceptance will have a positive outcome. ### Translation of business goals into data mining goals - The main data mining goals are: - **Descriptive task** - group clients accordingly to the outcome of the loans and find common patterns that characterize their profile. Define the attributes that better identify a "good" and a "bad" client. - **Predictive task** - make a prediction on the outcome of a loan, based on the data that we have available about the client who requested it. This predictions will be based on the description of the client so that they are easily explained by the descriptive task. ## DU - Data Understanding ### Choose which data is unnecessary to complete our goal - On the **loan table** we have the amount, duration and payments columns and they are related based on the following formula **`payments * duration = amount`**. Thus, we decided that one of them is unnecessary and we decided to remove the payments, keeping the amount and the duration. - On the **district table**, both the district's name and identifier are keys to the table. One can be removed. - On the **transactions table**, the k_symbol table may be merged with the operation column. For example, when the k_symbol column value is 'interest credited', the operation column value is empty. - Disponent clients can't ask for a loan and, loans are given to accounts and not to clients, for that, we decided **not to include the disponent clients** and only keep the owners of the accounts. (include only a count of disponent clients). ### Graphics and Statistics #### Loan train vs loan test ![](https://i.imgur.com/1QTJYxO.png) Based on this we can see that our dataset is unbalanced and, as such, we may need to apply a strategy to reduce the difference in those percentages. Possible strategies include: - Replication of unsuccessfull_loans - Reduction of successfull_loans - Creation of unsuccessfull_loans (SMOTE) ![](https://i.imgur.com/OpInY3k.png) ![](https://i.imgur.com/OYZmsFH.png) The values presented show us that the value of the loans has risen from the train data to the test data, this can be explained by inflation, train data is from 93-96 and test data is from 97-98. This increases probably won't be a problem as they are small and the data both in the train and the test seems to have a great distribution since both the standard deviations are large numbers and both first and third quartile are not close to the mean nor to the extremes. #### Clients age and sex There are **2724 male clients (50.73%)** and **2645 female clients (49.26%)** which means our data is distributed between both sexes. ![](https://i.imgur.com/hn83caL.png) ![](https://i.imgur.com/Tp80qs8.png) As we expected, there are close to no clients with less than 18 years because those people are minors and, past 60 years, the number of clients per age drops significantly. This implies we have a great variety of data, at least in this table. #### Loan amount and duration ![](https://i.imgur.com/2Z0tX2y.png) ![](https://i.imgur.com/XUpi0lL.png) On the first graph we correlate the duration, amount and status of a loan. With that information we can see that there is no direct implication of both the amount nor the duration of a loan will affect if the loan will be a good one or a bad one. On the second one, as expected, we can see that the payments are directly dependent on the duration and the amount of the loan, amount = duration * payments. #### Cards ![](https://i.imgur.com/LC0c6kg.png) ![](https://i.imgur.com/Vwcohl4.png) The cards distribution is not uniform and there are not a lot of cards from clients that asked for a loan, but for all of the clients with cards that asked for loans there was a 100% chance of successfull loan and that's why we think this data is important. #### Minimum and maximum account balance ![](https://i.imgur.com/nWsstr4.png) ![](https://i.imgur.com/5SpaPNe.png) ![](https://i.imgur.com/RGyCmh8.png) Based on the previous graphs, we can see that while the maximum balance a count had didn't had a big impact on the success of the loans, in the accounts that, at some time, had negative values, the accounts that have a minimum balance below zero, had a 100% chance of being an unsuccessfull loan. Even though this might be due to the low amount of cases that fit in this group, this is a field that might be really important in our project. #### Missing data ![](https://i.imgur.com/fW10xAH.png) ![](https://i.imgur.com/yXlNS1K.png) In both the unemploymant rate '95 and no. of commited crimes '95 there is one missing value on the "Jesenik" district and, to calculate a good replacement, we compared that table with most other columns on the disctricts table and we found a great relation between those columns and the same ones in '96. ## DP - Data Preparation ### Final columns - account - [ ] account_id - [ ] district_id - [x] **frequency** - one hot encoding - [x] **account_date** - split into year and month columns - card - [ ] card_id - [ ] disp_id - [x] **card_type** - [x] **issued** - extracted issued year - transaction - [ ] trans_id - [ ] account_id - [ ] trans_date - [x] **trans_type** - extracted number of credits and number of withdrawals - [ ] operation - [x] **trans_amount** - extracted mean credit amount and mean withdrawal amount - [x] **balance** - extracted min, max and current balance - [ ] k_symbol - [x] **bank** - extracted number of distinct banks per account_id - [x] **account** - extracted number of distinct accounts per account_id - loan - [ ] loan_id - [ ] account_id - [x] **loan_date** - splitted into year and month - [x] **loan_amount** - [x] **duration** - [ ] payments - [x] **status** - district - [ ] code - [ ] [removed] **name** - [ ] [removed] **region** - [x] **no. of inhabitants** - [x] **no. of municipalities with inhabitants < 499** - [x] **no. of municipalities with inhabitants 500 - 1999** - [x] **no. of municipalities with inhabitants 2000 - 9999** - [x] **no. of municipalities with inhabitants > 10000** - [x] **no. of cities** - [x] **ratio of urban inhabitants** - [x] **average salary** - [x] **unemployment rate '95** - missing value calculated from linear regression with unemploymant rate '96 - [x] **unemployment rate '96** - [x] **no. of enterpreneurs per 1000 inhabitants** - [x] **no. of committed crimes '95** - missing value calculated from linear regression with no. of commited crimes '96 - [x] **no. of committed crimes '96** - disp - [ ] disp_id - [ ] client_id - [ ] account_id - [ ] type - [x] [new] **ownership_count** - extracted from the number of clients for the same account - client - [ ] client_id - [x] **birth_number** - extracted 'gender' and 'age' columns - [ ] district_id - State what pre processment is necessary to normalize data (eg. normalize data) - Extract client sex from birthdate - Clean date values ## Doubts - Name and region of districts and card type should be A, B, C, etc? - Dates should be a column date or splitted into multiple? - Is year enough for cards and year and month for loans? - Normalizations? - Cards missing values should be replaced or keep empty? - Should we implement SMOTE now or on rapid miner? ## Rapid Miner ![](https://i.imgur.com/Q15VCx8.png) First submission, Decision tree, SMOTE equalize, no mean no percentages ![](https://i.imgur.com/N3em4uT.png) Decsion tree, SMOTE 100, no mean normalization, no percentages ![](https://i.imgur.com/tpcChvl.png) Decision tree, SMOTE with 100, no mean normalization ![](https://i.imgur.com/pc1kRZH.png) Decision tree, SMOTE equalize, no mean normalization ![](https://i.imgur.com/MmjiV3s.png) Decision tree, SMOTE with 200, no mean normalization ![](https://i.imgur.com/qFeVhkJ.png) Decision tree, SMOTE with 150, no mean normalization ![](https://i.imgur.com/50gQ96p.png) Deep learning, SMOTE with 160, no mean no percentages ![](https://i.imgur.com/0m1h424.png) Neural Net, SMOTE with 160, no mean no percentages ![](https://i.imgur.com/ueI9ubC.png) AutoMLP, SMOTE with 160, no mean no percentages Was worse with equalize and with 100 ![](https://i.imgur.com/HhGxoQ8.png) AutoMLP, SMOTE with 160, with mean normalization and percentages ![](https://i.imgur.com/UhKVbgX.png) AutoMLP, SMOTE with 160, no mean normalization ![](https://i.imgur.com/2VmsFp2.png) Decision tree, SMOTE 160, with distrit evolutions - **SUBMISSION**: 0.91172 ![](https://i.imgur.com/NBmjnnd.png) Random forest, SMOTE 160, **SUBMISSION**: 0.94609 ![](https://i.imgur.com/8kfAZv6.png) Random forest 30 trees information gain, NO SMOTE, no prunning, duplication of -1s, **SUBMISSION**: 0.91810 ![](https://i.imgur.com/47cseZ9.png) Random forest 30 trees information gain, SMOTE inside cross validation with 100, no prunning, duplication of -1s, **SUBMISSION**: 0.94197 ![](https://i.imgur.com/tsm4RV1.png) Gradient Boosted trees, smote equalize, multiply 3 times, model with all train data, maximal depth 7, learning rate 0.5, **SUBMISSION**: 0.96954 ### ROC Without SMOTE ![](https://i.imgur.com/j3ZpHR4.png) with SMOTE ![](https://i.imgur.com/zjiBGLA.png) ![](https://i.imgur.com/SCkBXfS.png) ![](https://i.imgur.com/atFYiZl.png) ![](https://i.imgur.com/A0xqnDO.png) ### Feature engineering ideas - [x] Transactions per month - [x] Age of account (in days/months) at the time of loan request (then remove month and day of account creation) - [x] Balance per month (divide medium / max balance by the number of month) - [x] Instead of client age, use client age at the time the loan was made (then remove client age and birthdate) - in MASTER - [x] Minimum transaction amount (credit and withdrawal) - [x] Maximum transaction amount (credit and withdrawal) - [x] Average transaction amount (credit and withdrawal) - [x] Max balance / loan_amount, current balance / loan_amount, etc - [x] Days since last transaction until loan request - [x] Balance avg - [x] Loan amount per balance (min, max and current) ### For final delivery - [ ] Different clustering methods evaluation - [ ] Why duplicate and then apply smote? ### Cluster - K-Means Elbow graph (inverted because rapidminer...) ![](https://i.imgur.com/zvltnVC.png) - K-Medoids Elbow graph ![](https://i.imgur.com/aUu17k9.png) - DBScan graph with 1950/5 ![](https://i.imgur.com/Ebm7omz.png)