Project18 - HackMD

--- title: Project18 tags: teach:MF --- # Machine Learning & FinTech: ## Project by Le Quoc Tuan ## Student ID: 409707005 #### Keywords: Bank failure, Financial safety, Banking System, Early detection, Financial indicators. --- ## Name of project: **"Using machine learning to build a model for predicting failure and early detection of troubled commercial banks"** ## 1. Motivations: The financial crisis of 2007-2009 shows the gaps in the global financial safety net. Early warning systems and models at time failed to predict the failure of banks, and subsequently, the financial crisis (David and Karim, 2008; Hasan, Liu and Zhang, 2016). Since financial crisis is the matter of economic cycle, it is reasonable for us to expect other crises to happen in the future. In fact, Fahlenbrach et al. (2012) shows that banking industry suffer from risk culture such that those who performed poorly in financial crisis of 1998 continue to do so in the 2008 crisis. Therefore, the development of better assessment and early warning models are crucial to prevent or, at least, mitigate the damage caused by potential future crises. Several new assessment models and systems have been developed in the period subsequent to the crisis time. For instance, SAFE by Oet et al. (2013) or combination of traditional models such as CAMELS and more complicated models of quantile regresion in Shaddady and Moore (2019). Recent works such as Holopainen and Sarlin (2017) or Samitas et al. (2020) also started to use machine learning for developing early warning system. They also find that conventional statistical models are outperformed in the presence of more sophisticated machine learning methods. While new machine learning methods are built for early detection and warning at financial crisis level; to the best of my knowledge, however, few studies has considered machine learning at micro-economic level (bank level, for instance). Hence, I am interested in using machine learning for building and comparison of “overall operation assessment model” to attempt for early detection of troubled commercial banks. This will potentially helps the regulators to identify the banks in distress and take appropriate action (early intervention, special supervisory, capital restructuring) and reduce the impact on the whole banking system should the extreme events such as financial crisis occurs. --- ## 2. EDA **2.1. Data and sample** I collect the quarterly data on different financial indicators of U.S. banks from the Bank Regulatory Database provided by the Federal Reserve, which is available on Wharton Research Data Services (WRDS). The database provide accounting data for bank holding companies, commercial banks, saving banks and S&L institutions. I prefer this database to Bank data by Compustat, which also provides fundamental data both quarterly and annually, since there are too many missing data in the Compustat database. Using Bank Regulatory Database also helps identify banks that are failed during the sample period (defined as banks that either went out of the market or being acquired by other banks). I also examine the list of failed banks provided by FDIC, available at: https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/, to make sure that failed banks are correctly identified. Sample period is 20 years from 2000 to 2020. Unfortunately, the data at the moment is only available up to the first quarter of 2020. Nevertheless, it should not have any overly serious impact on the model specification, data training and testing. **2.2. Variables** In this assignment, I deployed a number of financial features which could be categorized into several main aspects that reflects the overall operation of banks, such as: - Bank assets: Size of banks, loans and investments - Bank liabilities: Deposit, borrowings - Bank income, expenses and profitability: interest income/expenses; non-interest income/expenses; NIM, ROA, ROE. - Bank asset quality: delinquent assets, charge-offs - Capital adequacy: CAR ratio, Tier-1, Tier-2 capital ratio. - Bank off balance sheet activities: securities lending and borrowing, unused lines of credit for real estates, underwriting, derivatives Of the last two aspects, securitizations of subprime mortgage and other off-balance sheet activities are found by literature (such as Piskorski et al., 2010 or Demyanyk and Hemert, 2011) to be one of the main causes for the global financial crisis of 2007-2009. Deposit insurance also depicted to bring moral hazard to banks and increase their risk-taking behavior after the U.S. Emergency Economic Stabilization Act of 2008 that increased the deposit insurance coverage greatly (Lambert et al., 2017). The last aspect is new to the literature and I would try to bring them into my model. A table of detailed variable definitions is as follows. **Table 1: Variables definition** | Variables | Definition | |--------------------------------|------------------------------------------------------------------------------------------------| | *Dependent variables* | | | failure | Binary variable, equals one if the bank fails and zero, otherwise | | *Assets side variables* | | | total_assets | Bank's total assets in millions of USD | | size | The natural logarithm of bank's total assets | | totalinv_assets | Total investments to total assets | | htminv_assets | Total investments in hold-to-maturity securities to total assets | | fsale_assets | Total investments in securities for sale to total assets | | refarmcol_assets | Loan secured by farm land to total assets | | interbank_assets | Loan to other banks to total assets | | loanagri_assets | Loan to agriculture to total assets | | ciloan_assets | Commercial and Industrial loan to total assets | | creditcard_assets | Credit card loan to total assets | | otherpploan_assets | Other personal loan to total assets | | otherloan_assets | Total other loan to total assets | | *Liabilities side variables* | | | lev | Total liabilities to total assets | | deposit_assets | Total deposits to total assets | | demanddep_assets | Demand deposit to total assets | | timedep_assets | Time deposit to total assets | | otherresidloan_assets | Other loan to resident property to total assets | | *Assets quality variables* | | | npl_assets | Delinquent loan to total assets | | loanlossalwc | Loan loss allowance to total assets | | loanlossprov | Loan loss provision to total assets | | chargeoff_assets | Total loan charge-offs to total assets | | *Bank's profitability variables* | | | nim | Net Interest Margin | | nnon_im | Non-Interest Margin | | roa | Returns on Assets | | *Off-balance-sheet activities* | | | secborrowed_obs | Securities borrowed to total assets | | seclent_obs | Securities lent to total assets | | revolvelines_assets | Revolving Real Estate line of credit to total assets | | creditcard_line_obs | Credit card line of credit to total assets | | resid_recol_obs | Resident property line of credit secured by resident property to total assets | | resid_norecol_obs | Resident property line of credit unsecured by resident property to total assets | | underwriting_obs | Underwriting to total assets | | derivatives_guarantor_obs | Derivatives with bank as guarantor to total assets | | derivatives_beneficiary_obs | Derivatives with bank as beneficiary to total assets | | *Capital adequacy variables* | | | car | Capital adequacy ratio, calculated as bank capital to Risk-weighted assets | | tier1_car | Tier-1 ratio, calculated as tier-1 capital to Risk-weighted assets | | tier2_car | Tier-2 ratio, calculated as tier-2 capital to Risk-weighted assets | **2.3. Exploratory Data Analysis (EDA) for panel data.** Table 2 provides summary statistics of the variables used in this assignment. **Table 2: Descriptive statistics of main variables** | Statistics | failure | size | htminv_assets | fsale_assets | lev | refarmcol_assets | interbank_assets | loanagri_assets | ciloan_assets | creditcard_assets | otherpploan_assets | otherloan_assets | demanddep_assets | timedep_assets | otherresidloan_assets | npl_assets | loanlossalwc | loanlossprov | nim | nnon_im | roa | chargeoff_assets | secborrowed_obs | revolvelines_assets | creditcard_line_obs | resid_recol_obs | resid_norecol_obs | underwriting_obs | tier1_car | |--------------|---------|---------|---------------|--------------|--------|------------------|------------------|-----------------|---------------|-------------------|--------------------|------------------|------------------|----------------|-----------------------|------------|--------------|--------------|--------|---------|---------|------------------|-----------------|---------------------|---------------------|-----------------|-------------------|------------------|-----------| | Observations | 379783 | 379783 | 376006 | 376354 | 372805 | 365882 | 379783 | 357345 | 373522 | 379783 | 353545 | 354190 | 375431 | 354190 | 357345 | 371643 | 360842 | 360712 | 354626 | 360952 | 361026 | 357344 | 357330 | 357345 | 357345 | 354190 | 357345 | 360489 | 355693 | | Mean | 0.4426 | 11.8810 | 0.0394 | 0.1805 | 0.8876 | 0.0366 | 0.0002 | 0.0445 | 0.1014 | 0.0002 | 0.0509 | 0.0025 | 0.1102 | 0.6994 | 0.1488 | 0.0018 | 0.0092 | 0.0020 | 0.0230 | -0.0138 | 0.0051 | 0.0019 | 0.0004 | 0.0143 | 0.0049 | 0.0267 | 0.0009 | 0.0000 | 0.1765 | | Std | 0.4967 | 1.6158 | 0.0957 | 0.1492 | 0.0921 | 0.0551 | 0.0069 | 0.0796 | 0.0872 | 0.0008 | 0.0495 | 0.0063 | 0.0766 | 0.1234 | 0.1199 | 0.0044 | 0.0070 | 0.0039 | 0.0115 | 0.0095 | 0.0084 | 0.0123 | 0.0094 | 0.0241 | 0.0180 | 0.0414 | 0.0061 | 0.0008 | 0.1563 | | Minimum | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0154 | 0 | 0 | 0 | -0.0008 | 0.0047 | -0.0477 | -0.0360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0754 | | p25% | 0 | 10.8929 | 0 | 0.0651 | 0.8793 | 0 | 0 | 0 | 0.0446 | 0 | 0.0154 | 0 | 0.0619 | 0.6623 | 0.0620 | 0.0000 | 0.0062 | 0.0001 | 0.0120 | -0.0190 | 0.0023 | 0.0001 | 0 | 0 | 0 | 0.0002 | 0 | 0 | 0.1088 | | p50% | 0 | 11.6722 | 0 | 0.1558 | 0.9035 | 0.0102 | 0 | 0.0052 | 0.0810 | 0 | 0.0377 | 0.0004 | 0.1005 | 0.7207 | 0.1235 | 0.0002 | 0.0082 | 0.0007 | 0.0218 | -0.0126 | 0.0051 | 0.0005 | 0 | 0.0039 | 0 | 0.0123 | 0 | 0 | 0.1348 | | p75% | 1 | 12.5874 | 0.0272 | 0.2653 | 0.9183 | 0.0534 | 0 | 0.0529 | 0.1321 | 0 | 0.0700 | 0.0017 | 0.1443 | 0.7696 | 0.2033 | 0.0017 | 0.0107 | 0.0020 | 0.0315 | -0.0070 | 0.0090 | 0.0016 | 0 | 0.0199 | 0.0004 | 0.0353 | 0 | 0 | 0.1832 | | Maximum | 1 | 21.7132 | 1 | 1 | 6.1947 | 0.6494 | 0.9146 | 0.7227 | 0.5066 | 0.0063 | 0.2604 | 0.0427 | 1 | 0.8710 | 0.9825 | 0.2518 | 0.3673 | 0.0261 | 0.0527 | 0.0165 | 0.0302 | 5.9973 | 0.7500 | 0.9140 | 0.1575 | 1.5345 | 0.5043 | 0.1210 | 1.3183 | The dataset used in the assignment include around 11,000 banks (including also thrifts, saving banks and depository institutions) in the U.S. with nearly 380,000 bank-quarter observations. From table 1, it could be derived that over the course of 20 years, around 44% of the banks in sample have failed. Count of unique data show the results as follows | bank_name | year | | -------- | -------- | | 10938 | 21 | **Figure 1: Distributions of variables used** ![](https://i.imgur.com/DwGD4ZY.png) Although there are some outliers, the distributions of main variables seem to be quite symmetrics. **Figure 2: Heatmap of US bank dataset** ![](https://i.imgur.com/YNmSaHF.png) The heatmap shows that the correlation coefficients between main dependent variable, failure, and other explanatory variables are not high. This is not a big problem in the binary classification since (1) the dependent variable is binary, and; (2) the significance of correlation coefficients is more important. It would also be reasonable to expect that each of the features used is an important aspect of bank's operation and can only explain a small proportion of bank failure. There are some high correlation coefficients of some variables such as leverage and tier-1 capital ratio (negative) or among different types of loans (positive). However, it is reasonable and not so high that multicolinearity could be a problem in the assignment. **Figure 3: Pairwise scatterplot of variable used** *(The image is too big to display, I will enclose with the assignment files)* Since the number of observation is too large (nearly 400,000 observations), the scatterplots in some cases such as time deposit/total assets - ROA or NIM show no clear relationship. --- ## 3. Problem formulation **3.1. Benchmark method** My benchmark approach to modelling prediction of bank failure is **logistic regression**. Logit regression was commonly used in predicting bank failure in previous literature such as Martin (1977); Coles and Gunther (1995) or Cleary and Hebb (2016). However, at time, the number of features (variables) used is not much and just a combination of various variables with the same attributes. In my assignment, I would try to consider the new aspects of bank failure found in recent literature as well as traditional ones. The logistic regressions model is as follows: $$Pr(Failure=1|X) = \Lambda(X'\beta),$$ In which, $$ \Lambda( X'\beta ) = \frac{1}{1+\exp^{-X'\beta}}$$ X includes the independent variables of various aspects shown in part 2. EDA. Inference for logistic models is done by Maximum Likelihood Estimates (MLE). **3.2. Machine Learning approaches** The main machine learning approaches that I am going to try in this assignment is K-Nearest Neighbor (KNN), Decision tree and Random Forest. *3.2.1. K-Nearest Neighbor* **K-Nearest Neighbor (KNN)** is a supervised learning algorithm where learning is based on how similar a data (vector) is from other. While it is considered a simple and "lazy" machine learning approach, it is quite useful in binary classification and can estimate the nonlinear relationship. Another attractive characteristic of KNN is no assumptions about data required. ![](https://i.imgur.com/JocopT3.png) source: https://www.kdnuggets.com/2020/11/most-popular-distance-metrics-knn.html KNN has some drawbacks such as: high memory requirement, sensitive to irrelevant features and sensitive to the scale of the data KNN is done by the following steps: - Pick a value for k - Take the K Nearest neighbors of the new data point according to their distance (measure by Euclidian, Manhattanm, Minkowski or Weighted). In this assignment, I let Python automatically choose the appropriate distance. - Among these neighbors, count the number of data points in each category and assign the new data point to the category where the most neighbors are counted. *3.2.2. Decision Tree* A **decision tree** is a flowchart-like tree structure where an internal node represents feature (or attribute). A decision rule is represented by each branch and the outcome is depicted by each leafnote. The root node is the topmost node of a decision tree **Figure 4: Decision tree illustration** ![](https://i.imgur.com/GEsbTps.png) source: https://www.datacamp.com/community/tutorials/decision-tree-classification-python The decision tree algorithm follow the steps: - Select the best Attribute Selection Measures (ASM) and place at the root of the tree to split the record. - Split the training set into subsets. - Start tree building by repeating the process until one of the conditions matched: All the tuples belong to the same attribute value, or; There are no more remaining attributes; or There are no more instances. Most popular Attribute Selection Measures (ASM) are Information Gain, Gain ratio and Gini Index. *Information gain*: $$Info(D)=-\sum_{i=1}^m p_ilog_2p_i$$ In which, pi is the probability that an arbitrary tuple in D belongs to class Ci $$Info_A(D)=\sum_{j=1}^v \frac{|D_j|}{|D|}\times Info(D_j) $$ $$Gain(A)=Info(D)-Info_A(D) $$ Where, v is the number of discrete values in attribute A; |Dj|/|D| acts as the weight of the j-th partition. Info(D) is the average amount of information needed to identify the class label of a tuple in D, and; InfoA(D) is the expected information required to classify a tuple from D based on the partitioning by A The attribute A with the highest information gain, Gain(A) is chosen as the splitting attribute at node (N). *Gain ratio*: $$SplitInfo_A(D)=-\sum_{j=1}^v \frac{|D_j|}{|D|}\times log_2(\frac{|D_j|}{|D|}) $$ Where, v is the number of discrete values in attribute A; |Dj|/|D| acts as the weight of the j-th partition. The gain ratio can be defined as $$GainRatio(A)=\frac{Gain(A)}{SplitInfo_A(D)} $$ The attribute with the highest gain ratio is chosen as the splitting attribute. *Gini index*: The Gini index is given by: $$Gini(D)=1-\sum_{i=1}^m p_i^2 $$ Where pi is the probability that a tuple D belongs to class Ci. If a binary split on attribute A partitions data D into D1 and D2, the Gini index of D is: $$Gini_A(D)=\frac{|D_1|}{|D|}Gini(D_1)+\frac{|D_2|}{|D|}Gini(D_2) $$ In case of discrete-valued attribute, the subset that gives the minimum gini index for that chosen is selected as a splitting attribute. In chase of continuous-valued attributes, the strategy is to select each pair of adjacent values as a possible split-point and point with a smaller gini index chosen as the splitting point. $$ΔGini(A)=Gini(D)-Gini_A(D) $$ The attribute with minimum Gini index is chosen as the splitting attribute. *3.2.3. Random Forest* **Random forest** consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forect spits out a class prediction and the class with the most votes become the model's prediction. **Figure 5 Random forest illustration** ![](https://i.imgur.com/ioXkIyj.png) Source: https://www.datacamp.com/community/tutorials/random-forests-classifier-python The algorithm for the random forest is as follows: - Select random samples from a given dataset. - Construct a decision tree for each sample and get a prediction result from each decision tree. - Perform a vote for each predicted results. - Select the prediction result with the most votes as the final prediction. Random forest is considered highly accurate and robust method since a large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual consituent models (**wisdom of crowds**). **3.3. Assessment of methods and other noteworthy aspects** The data on banks from Bank Regulatory is surprisingly detailed during the 2000-2010 period (while most of the missing values happen in 2011-2020 period), I intended to divide the sample into train period of 2000-2010 and use the 2011-2020 period as testing period. For some of the machine learning approaches, the requirement is all data points must be present in training data, thus, I dealt with missing value by using "pad" and "backfill" method. For logistic regression as well as other machine learning approaches, the goodness-of-fit of the model will be tested by the confusion matrix and other metrics ratios, namely: accuracy, precision and recall. **Figure 6: confusion matrix** ![](https://i.imgur.com/JjekQPl.png) source: https://computersciencesource.wordpress.com/2010/01/07/year-2-machine-learning-confusion-matrix/ **Confusion matrix** demonstrates the relationship between the actual values of the dataset (the actual number of failed and survived banks in this assigment) and the predicted values by the models/approaches we used. The diagonal values are true values. TP (TN) is True Positive (True Negative), which is the number of truly failed (survived) banks that are predicted correctly by the model. The higher the diagonal values are, the better the model is since it is able to classify more correctly. FP (FN) is False Positive (False Negative), which reprents the number of survived (failed) banks identified incorrectly as failed (survived) financial institutions. From the confusion matrix, we can also calculate other metrics for the "goodness-of-fit" of our binary classification approach as follows: $$Accuracy=\frac{True 1 + True 0}{True 1 + False 1 + True 0 + False 0} $$ $$Precision=\frac{True 1}{True 1 + False 1} $$ $$Recall=\frac{True 1}{True 1 + False0} $$ **Accuracy** shows the overall correctly identification of both positives and negatives of the methods used. While it is an important indicator, we actually care more about the correctly identified failures in banks (with potential implications for policy and early responses to save the whole financial system) rather than the safe-and-sound ones (who we just leave alone). Thus, the other two metrics are our main indicators of concern. **Precision** is very important, we need the indicator to be as high as possible. It shows that of all the predictions from our methods/approaches (True1 + False1), how many of them are correctly identified. If the prediction is low, that means our models provide many noisy predictions. In addition, if the regulators/policy-makers make decisions based on the model's forecasting, resources would be wasted on interventions/taking actions/restructuring the perfectly sound banks (False 1). Finally, **Recall** shows that wihtin all the truly failed banks (True1 + False0), how many of them are correctly identified by our benchmark model as well as machine learning approaches. This is also crucial considering that the failed banks may lead to contagious effect on other financial institutions during crisis time, which might worsen the whole crisis situation and threaten the whole economy (Iyer and Pedro, 2011). Since the number of features used in predicting bank failure is large (28 variables), it is reasonable to assume that some of the features are more important in explaining bank failure than the others. Too many features might disturb the efficiency of the predicting models and even reduce the precisions of in-sample and out-of-sample methods. Therefore, it makes sense to try to consider dimensions reduction and only consider the principal components in logistic regressions as well as other approaches.Thus, I will also use PCA in my analysis of bank failure in this assignment --- ## 4. Results and analysis In this part, I provide the results of prediction of U.S bank failure using the benchmark method of logistics model as well as machine learning approaches of choice. I will first consider all features (even irrelevant ones) and then deploy PCA to reduce the dimensions and pick out the principal components. I expect that the machine learning approaches may outperform the benchmark method, especially after PCA. **4.1 Baselines results with all features** Since all of the features used in the analysis are continuous variables, I first standardize the data (after dealing with the missing values). The benchmark approach, logistic regression is used first for the modelling and prediction of bank failure. Figure 7 shows the confusion matrices of in-sample and out-of-sample test using benchmark approach. **Figure 7: Confusion matrices of logistic regression** ![](https://i.imgur.com/CqHFJmw.png) As can be seen from the figure, while the logistic regression predict the number of banks that survived (failure = 0 or True Negative) well, it does not do a very good job capturing the number of failed banks in both in-sample and out-of-sample test. We can see it more clearly looking at the metrics which are reported in Table 3. The results depict that logistic regression did a poor job in identifying failure of U.S banks. While the Accuracy are high (over 60%), it could be attributed mostly to the accuracy in identifying survived banks. Since we want our model to be able to correctly identify failure banks to monitor them, we should focus more on the other two metrics: precision and recall. For in-sample tests, of all the failures predicted, 62.61% are correct. However, for out-of-sample tests, only 21% of the predicted failures are true failures. The recall results show a brighter picture for logistic regression. Of the true failures in the data, logit model was able to identify only 46% in train data but managed to correctly classify nearly 55% in test data. I continue with the machine learning approaches with the train and test data. The deployed methods are KNN, decision tree and random forest. The in-sample confusion matrix as well as other metrics is shown in Figure 8: **Figure 8: Confusion matrices Machine Learning: in-sample test** ![](https://i.imgur.com/eZkv8dX.png) For train data, machine learning approaches seem to identify the failures and survivals quite accurately as shown in Figure 8 and the in-sample test of Table 3. **Table 3: Comparision of classification approaches** | In-sample test |Logistic regression| KNN | Decision Tree | Random Forest | |--------------------|------|--------|---------------|---------------| | Accuracy |0.6325| 0.9459 | 1 | 1 | | Precision |0.6261| 0.9422 | 1 | 1 | | Recall |0.4625 |0.9378 | 1 | 1 | | **Out-of-sample test** |**Logistic regression**| **KNN** | **Decision Tree** | **Random Forest** | | Accuracy |0.6125| 0.5689 | 0.5111 | 0.7076 | | Precision |0.2109 |0.1847 | 0.1574 | 0.2389 | | Recall |0.5482 |0.5227 | 0.4961 | 0.407 | However, the main focus of this assignment is to see how well the machine learning approaches fair against test data (out-of-sample data). Figure 9 and second part of table 3 illustrate the evidence for the machine learning methods: **Figure 9: Confusion matrices Machine Learning: out-of-sample test** ![](https://i.imgur.com/67uPGUa.png) Surprisingly, while the machine learning approaches may seem to outperform the benchmark approach in in-sample tests, they perform poorly in out-of-sample tests. From the confusion matrices, one can clearly observe that the number of true failures predicted of all three machine learning methods (KNN: 944; Decision Tree: 896; Random Forest: 735) are lower than that of logistic regression (990). From the metrics in table 3, other machine learning methods perform worse than logistic regression. Only Random Forest has higher accuracy than benchmark approach (0.7076 in comparison to 0.6125). Nevertheless, the higher accuracy come mostly from the ability to correctly identify the survived banks, which we are not really interested in. When comparing the failure predicting power, the Random Forest method is slightly better than logistic regression as it manage to reduce the false failure prediction leading to slightly higher precision (24% > 21%). Hence, from the evidence, it seems that machine learning approaches could not outperform the benchmark method of logistic regression. This could come from the dataset itself since the data is not perfectly completed. Another reasons could be that we are using too many features in predicting the failures of U.S banks. While some of them are relevant, other could be redundant and disturb the predicting power of machine learning approaches. Methods such as KNN are very sensitive to irrelevant features and missing data. "Garbage in, Garbage out" is of great concern when dealing with machine learning approaches. Therefore, I spent time to think about how to salvage and improve the predicting power of machine learning approaches. In the next part, I will try to deploy **Principal Component Analysis** to reduce the number of dimensions and transform the data accordingly to see whether the results of machine learning approaches get any better. **4.2.PCA and results with robust data** Principal Component Analysis (PCA) capture the intrinsic variability in the data and could be used to extract most representative features. Therefore, it is appropriate in this case where we are dealing with a large number of features and some of them might be irrelevant. I use Scree plot and the Eigenvalues to identify the proper number of features to be used in the assignment. First, I try to fit PCA for all 28 features and draw a Scree plot in Figure 10: **Figure 10: Scree plot** ![](https://i.imgur.com/4l3vNEN.png) The Scree plot shows that from the principal component 11, the proportion of variance explained is not significantly changing anymore. I confirm the finding by print out the eigenvalues for all 28 features as follows: [2.84824604 2.39761947 2.18188574 1.62355253 1.59938311 1.41945655 1.23532715 1.17552904 1.03206914 1.01202493 1.00085801 0.99209842 0.98288084 0.95192083 0.89123117 0.88451024 0.83467573 0.80061457 0.7300205 0.71774639 0.58729431 0.56508673 0.36717429 0.35901305 0.32672882 0.25920953 0.159571 0.06434559] We hold on to principal components whose eigenvalues greater than 1. There are 11 of them. Thus, I continue to transform the train and test data according to 11 principal components and run the tests again for benchmark approach as well as other machine learning methods. Figure 11 illustrates the confusion matrices of logistic regression: **Figure 11: Confusion matrices after PCA: logistic regression** ![](https://i.imgur.com/dVJyHVV.png) With the reduced features from PCA, the performance of logistic regression slightly reduced in comparison to when we use 28 features in 4.1. The confusion matrices for other machine learning approaches are shown in Figure 12. **Figure 12: Confusion matrices after PCA: machine learning methods** ![](https://i.imgur.com/Yx4nh5w.png) From the figures, the prediction results of three machine learning approaches after PCA transformation seem to improve greatly when compared with the original data of all 28 features. To see a clearer picture, I show the comparisons of metrics for benchmark approach as well as machine learning approaches in Table 4. **Table 4: Comparison of binary classification approaches after PCA** | In-sample test | Logistic | KNN | Decision Tree | Random Forest | |--------------------|----------|--------|---------------|---------------| | Accuracy | 0.6169 | 0.8578 | 1 | 1 | | Precision | 0.6154 | 0.8544 | 1 | 1 | | Recall | 0.405 | 0.826 | 1 | 1 | | **Out-of-sample test** | **Logistic** | **KNN** | **Decision Tree** | **Random Forest** | | Accuracy | 0.5154 | 0.8109 | 0.7005 | 0.8057 | | Precision | 0.1563 | 0.4372 | 0.2964 | 0.4278 | | Recall | 0.4845 | 0.7735 | 0.6805 | 0.7558 | For in-sample tests, KNN, Decision Tree and Random Forest still shows high accuracy and precision, despite the KNN's metrics seem to slightly decreased compared to section 4.1. Nevertheless, all machine learning approaches see a great improvement in prediction power in test data. While the accuracy of both KNN and Random Forest are higher than 80%, their prediction of failure got more than 40% correct identification (Logistic only gets 22% without PCA and nearly 16% with PCA), which is a remarkable change in precision. Within the sample of truly failed banks, all three machine learning approaches using PCA-transformed data managed to identified around 70% of the case (much higher than over 48% of benchmark approach). Hence, with a better set of data, the machine learning approaches greatly improve their performance and remarkably outperform the logistic regression (benchmark) method. --- ## 5. Conclusions In this assignment, I attempted to deploy various machine learning methods; namely, K-Nearest Neighbor, Decision Tree and Random Forest; to predict the failures of U.S banks during the period of more than 20 years (2000 to 2020). With the raw data of 28 features of different aspects of bank operation and performance, the machine learning approaches do not seem to produce better failure prediction of banks than the benchmark approach of logistic regression in out-of-sample tests. However, using PCA to transform the data significantly improve the accuracy, precision and prediction power of machine learning approaches to the point that they greatly outperform logistic regression for both in-sample and out-of-sample data. Despite being a simple approach, the power of predicting and the efficiency of KNN is surprisingly high. For the other two approaches in this assignment, Random Forest shows its superiority of wisdom of crowds in comparison to Decision Tree with overwhelming accuracy and precision of predicting failure among U.S banks. Returning to our main interest in the assignment of choosing a model for failure prediction of banks, then the winner is KNN over other models. The second best model which could be considered using in practice would be Random Forest. From the assignment, I learnt an important lesson in applying machine learning methods for binary classification. In order to deploy machine learning approaches efficiently, it is crucial to understand the pros and cons of each machine learning method and to make sure the data is of high quality. ## Reference list 1. Cleary, S. and Hebbm G.; 2016. An efficient and functional model for predicting bank distress: in and out of sample evidence. *Journal of Banking and Finance* 64, 101-111 2. Coles, R. A. and Gunther, J. W.; 1995, separating the likelihood and timing of bank failure. *Journal of Banking and Finance* 19(6), 1073-1089 3. Davis, E. P.; Karim, D.; 2008. Comparing early warning systems for banking crises. *Journal of Financial Stability* 4(2),89-120 4. Demyanyk, Y. and Hemert, O. V.; 2011. Understanding the Subprime Mortgage Crisis. *The Review of Financial Studies* 24(6), 1848-1880 5. Fahlenbrach, R.; Prilmeier, R. and Stulz, R. M.; 2012. This time is the same: Using Bank performance in 1998 to explain bank performance during the recent financial crisis. *The Journal of Finance* 67(6), 2139-2185 6. Hasan, I.; Liu, L. and Zhang, G.; 2016. The determinants of Global Bank Credit-Default-Swap spreads. *Journal of Financial Services Research* 50, 275-309 7. Holopainen, M. and Sarlin, P.; 2017. Toward robust early-warning models: a horse race, ensembles and model uncertainty. *Quantitative Finance* 17(12), 1933-1963 8. Iyer, R. and Peydro, J. L.; 2011. Interbank contagion at work: Evidence from a natural experiment. *The Review of Financial Studies* 24(4), 1337-1377 9. Lambert, C.; Noth, F. and Schuwer, U.; 2017. How do insured deposits affect bank risk? Evidence from the 2008 Emergency Economic Stabilization Act. *Journal of Financial Intermediation* 29-C, 81-102 10. Martin, D.; 1977. Early warning of bank failure, a logit regression approach. *Journal of Banking and Finance* 1, 249-276 11. Piskorski, T; Seru, A. and Vig, V.; 2010. Securitization and distressed loan renegotiation: evidence from the subprime mortgage crisis. *Journal of Financial Economics* 97(3), 369-397 12. Oet, M. V.; Bianco, T.; Gramlich, D. and Ong, S. J.; 2013. SAFE: an early warning system for systemic banking risk. *Journal of Banking and Finance* 37(11), 4510-4533 13. Samitas, A.; Kampouris, E. and Kenourgios, D; 2020. Machine Learning as an early warning system to predict financial crisis. *International Review of Financial Analysis* 71, 101507 14. Shaddady, A. and Moore, T.; 2019. Investigation of the effects of financial regulation and supervision on bank stability: The application of CAMELS-DEA to quantile regressions. *Journal of International Financial Markets, Institutions and Money* 58, 96-116 --- ## Website reference 1. https://www.datacamp.com/community/tutorials/decision-tree-classification-python 2. https://towardsdatascience.com/k-nearest-neighbor-python-2fccc47d2a55 3. https://towardsdatascience.com/understanding-random-forest-58381e0602d2 4. https://www.datacamp.com/community/tutorials/random-forests-classifier-python 5. https://computersciencesource.wordpress.com/2010/01/07/year-2-machine-learning-confusion-matrix/ 6. https://www.kdnuggets.com/2020/11/most-popular-distance-metrics-knn.html --- ## Important Files (on E3)