02450 Introduction to Machine Learning and Data Mining

# 02450 Introduction to Machine Learning and Data Mining --- # Report Project 2 **Group members**: - Jesper Vang - 094536 - Ole Batting - s173894 - Marco Placenti - s202798 **Contribution table**: Generally all members contributed to all sections. The following table describes who were responsible/main contributor to each section: | Name | Section | |- |- | | Jesper Vang | 1 & 2 & 3 & 4 | | Ole Batting | 1 & 2 & 3 & 4| | Marco Placenti | 1 & 2 & 3 & 4| --- ## 1. Regression, part a ### 1.1 In this section we intent to predict the Baseline Histological Grading from all other features but the Baseline Histological Staging. Baseline Histological Grading and Staging are usually measured by biopsy, so this is an attempt to predict the result of a biopsy with less invasive measures. The PCA showed that redundancy has been added with the one-of-K coding. This makes sense since the features where this was applied are binary, and since the last column in one-of-K is redundant. Apart from this there was little to be gained from the PCA. All features have been normalized – zero mean and unit standard deviance. ### 1.2 A regularized linear regression is applied to the data, with a range of values for the regularization parameter λ: {10i, i = -2.0, -1.5, -1.0,..., 9.5} With 10 fold cross validation the following generalization errors are estimated for the range of λ: ![rlr_lambdas](https://i.imgur.com/GuTkgbT.jpg) In this figure we see little difference in the generalization error for single fold, with some difference between folds. However there is a minimum at λ = 104 with the generalization error Etest = 16.183. From the relatively large regularization parameter we see the best model has relatively low variance and high bias. This is verified from the weights. ### 1.3 The RLR model predicts with the following weights: | Feature | Weight | |---------|--------| ||9.77547475e+00| |Age|-4.58731061e-02| |Gender|3.27142021e-03| |BMI|-3.73936066e-02| |Fever|-3.89614286e-02| |Nausea/Vomiting|-5.65415224e-02| |Headache|-6.24458492e-03 | |Diarrhea|4.44879754e-02| |Fatique & generalized bone ache|-4.13577880e-02| |Jaundice|-6.45948128e-03 | |Epigastric pain|-6.21901166e-03 | |WBC|6.91336795e-02| |RBC|-1.69206005e-02| |HGB|2.05823279e-02 | |Plat|4.23526943e-02 | |AST 1|-1.98311823e-02| |ALT 1|-9.18390043e-03 | |ALT 4|-2.67170913e-02 | |ALT 12|-7.67489169e-03 | |ALT 24|-2.76910624e-02| |ALT 36|-1.20169154e-02 | |ALT 48|3.52782908e-02 | |ALT after 24 w|1.66782768e-02 | |RNA Base|-1.23678834e-02| |RNA 4|-6.11246534e-02 | |RNA 12|-6.39455579e-02 | |RNA EOT|-5.95580870e-02 | |RNA EF|3.33081189e-02| This looks like no feature is contributing significantly to the prediction, and the bias just aims for the mean value. ## 2 Regression, part b ### 2.1 2 layer K fold cross validation has been done to the data set with the inner loop comparing regularization parameters λ: {10i, i = -2.0, -1.5, -1.0,..., 9.5}, and number of hidden units in an artificial neural network h: {1, 2, 3, 4, 5}, and the outer loop determining performance of the models. The cross validation has been set to K1 = K2 = 5 to reduce computation time. ### 2.2 |Outer fold|ANN|<---|RLR|<---|baseline| |-|-|-|-|-|-| |i|hi*|Eitest|λi*|Eitest|Eitest| |1|1|18.305|3.162e+03|16.777|16.582| |2|1|17.102|3.162e+03|15.437|15.392| |3|1|18.259|3.162e+09|17.035|16.947| |4|1|16.853|1.000e+04|16.065|16.115| |5|1|17.193|3.162e+03|15.796|15.727| At a glance performance looks pretty bad as neither ANN nor RLR performs better than the baseline in most folds, only RLR outperforms the baseline and only in 1 fold. We also see that ANN for all folds showed best performance with only 1 hidden unit. We get a very similar regularization parameter value for most folds compared to part a. ### 2.3 We will run 3 statistical test to compare the three models with a correlated t-test with 1-α=0.05. The null hypothesis is that there is no difference between the models. This would be accepted by a p-value above 0.05 and a CI containing 0, and it would be rejected by a p-value below 0.05 and a CI not containing 0. Comparison of baseline and RLR: When comparing we get a p value of 0.257 or 25.7%, and a CI of (-0.216, 0.077). We thus accept the null hypothesis of no difference between performance. Comparison of baseline and ANN: When comparing we get a p value of 0.006 or 0.6%, and a CI of (-2.061, -0.718). We thus reject the null hypothesis of no difference between performance. The Baseline is then significantly better than the ANN. Comparison of RLR and ANN: When comparing we get a p value of 0.003 or 0.3%, and a CI of (-1.886, -0.755). We thus reject the null hypothesis of no difference between performance. The RLR is then significantly better than the ANN. To ensure the RLR and ANN did actually work as models for the implementation, they were tested on different data (See appendix). This confirmed that our dataset is just very difficult to do regression on. ## 3 Classification ### 3.1 Problem Description & Chosen Algorithms Given our dataset, the most valuable as well as achievable task is the prediction of the staging of the Hepatitis C virus given the same set of features used for the regression task. Such task is a multi-class problem, as we have 4 possible classes to predict on, namely Portal Fibrosis (class 1), Few Septa (class 2), Many Septa (class 3) and Cirrhosis (class 4). We did not have clear and well defined expectation regarding the performances of the classifier. That being due to the exploratory data analysis not showing any evidence of features clearly discriminating between classes. All features as well as the target showed almost a uniform distribution, and the features distribution remained the same even when segmented by class. ![](https://i.imgur.com/wvwniyu.png) Due to these peculiarities of the dataset, we decided to use a K Nearest Neighbors (KNN) algorithm as algorithm of our choice. We believed that Naive Bayes Classifier would not have performed well given the distribution of our dataset, which could have confused it. We also ruled out the Artificial Neural Network, which we believed could have been able to capture much complex behaviours that seems intrinsic in our data. However, we have a very limited number of observation (roughly 1,300) - which is definitely too small for the neural network to learn a general function. We also could have run the risk of overfitting to the data, since exposing too many times the same observations to the network would have inevitably created bias. Hence, the decision was between KNN and Decision Tree Classifier. We opted for the former as we believed that its simplicity could have led to the best approximation. ### 3.2 Nested Cross-Validation Before applying nested cross validation for hyper-parameters tuning and estimation of generalization error, we came up with a list of candidate hyper-parameters for each selected model. These lists of candidate hyperparameters were chosen to be of same length in order to make it easier to code our solution in a time efficient way. |Algorithm|Hyper-parameter|Candidate values| |-|-|-|-|-|-| |KNN|K|1, 3, 5, 7, 9, 11, 13, 15| |Logistic Regression|$\lambda$|0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 5.0, 25.0| The results from the cross-validation were the following |Outer fold|KNN|<---|LogReg|<---|baseline| |-|-|-|-|-|-| ||Ki*|Eitest|λi*|Eitest|Eitest| |1|1|0.791|25|0.770|0.741| |2|11|0.770|0.1|0.770|0.741| |3|1|0.741|25|0.676|0.741| |4|11|0.748|25|0.799|0.741| |5|1|0.741|25|0.777|0.741| |6|9|0.703|0.1|0.717|0.739| |7|15|0.732|25|0.703|0.739| |8|1|0.761|25|0.674|0.739| |9|15|0.725|0.1|0.746|0.732| |10|5|0.732|5|0.768|0.732| None of the model has delivered good results - in fact both the KNN and the Logistic Regression conform to the majority class baseline - with Logistic Regression being slighly better than KNN on average. ### 3.3 McNemar's Test For each model, at each run on the partitioned data of the outer fold, we collected the predictions. We then concatenated the predictions together in order to perform a McNemar's test, which would tell us if one model was actually better than the others. These are the results we have gotten |Model 1|Model 2|p-value|Conf.Int. Lower Bound|Conf.Int Upper Bound|Response| |-|-|-|-|-|-| |KNN|Baseline|0.750|-0.0369|0.0253|No difference| |KNN|Logistic Regression|0.820|-0.0356|0.0268|No difference| |Baseline|Logistic Regression|0.962|-0.031| 0.0285|No difference| In all cases, as easily anticipable by the poor and totally random performance, nor the KNN neither the Logistic Regression represent an improvement compared to the majority baseline. ### 3.4 Logistic Regression As per requirement, we execute a logistic regression to the whole dataset set split into train and test set. We use $\lambda = 25$ since multiple folds in the nested cross validation seemed to have prefered this value for the hyperparameter. Unsurpinsigly, what we got is another random prediction, with an error, or misclassification rate, of $0.75$. Logistic regression for multi-class problems, computes one weight vector for each class. Then, each vector is multiplied with the input feature vector. This dot product results in a scalar, which finally represents the probability of the input vector to belong to the class associated to the weight vector. The weights for each class that we got are summed up in the following table | Feature | Weights Class 1 | Weights Class 2 | Weights Class 3 | Weights Class 4 | |---------|--------|--------|---------|--------| |Age|-0.00449854|0.0644664|0.02969588|-0.08966374| |Gender|-0.01153353|0.17603436|-0.03214119|-0.13235964| |BMI|-0.01982349|-0.02220056|-0.01663909|0.05866314| |Fever|0.0042284|0.01673376|-0.03225451|0.01129235| |Nausea/Vomiting|0.05117326|-0.02593768|-0.03134782|0.00611224| |Headache|-0.01244321|0.03666004|0.03770929|-0.06192612| |Diarrhea|-0.00827827|0.02911074|0.05524369|-0.02060742| |Fatique & generalized bone ache|-0.07717445|0.03485603|-0.02300283|-0.07607616| |Jaundice|-0.0121613|0.0422731|0.02270156|0.06532125| |Epigastric pain|0.00736047|-0.00793032|-0.03395387|-0.05281336| |WBC|-0.02050536|0.02134535|-0.01595356|0.03452373| |RBC|-0.00067017|-0.0285514|0.00468471|0.01511357| |HGB|-0.01823141|0.02368913|0.03008133|0.02453686| |Plat|-0.01220975|-0.05187925|-0.00757244|-0.03553904| |AST 1|0.01149742|-0.05249221|0.00880215|0.07166144| |ALT 1|0.00398356|0.03881301|-0.03482672|0.03219264| |ALT 4|0.02485457|-0.13696401|0.1411447|-0.00796985| |ALT 12|-0.01810768|-0.01972954|0.09669808|-0.02903526| |ALT 24|-0.13296572|-0.03095795|0.09034594|-0.05886086| |ALT 36|-0.01425375|-0.03566909|0.03739754|0.07357774| |ALT 48|0.03957146|-0.08615759|0.02448796|0.01252531| |ALT after 24 w|-0.07992948|0.15894209|-0.06529531|0.02209817| |RNA Base|-0.14529853|0.14867774|-0.03996665|-0.01371731| |RNA 4|0.15636006|-0.0017945|-0.13395813|0.03658744| |*Intercept*|0.06616015|-0.05751915|-0.039126|0.030485| As in the regression task, there is no clear evidence of one feature being able to discriminate between classes. For class 1 and 2 we see RNA Base to have somewhat of a bigger impact, but nowhere near being effective. It also seems that some features, such as RNA 4 for class 1 and 3, and ALT 4 for classes 2 and 3, have an unusually higher (either positive or negative) weight compared to the rest. However, we believe that these coefficients are so high due to random noise and not because an actual patter is being identified. ## 4 Discussion As we were not satisfied with the preliminary results in this part of the project, we decided to search through related publications. While the dataset we used was only recently publically released on UCI, in 2020 we found six closely associated articles, five of them was accepted during the last seven months. All the articles are aimed at predicting advanced liver fibrosis utilizing some form of machine learning or ANN in chronic hepatitis C patients, thus identical classification problems. For the sake of this discussion, it should be mentioned again that the dataset contains four stages of fibrosis (F1, F2, F3, F4) and we decided to try and predict all four stages. We noticed that almost all the publications choose to concatenate the four stages using eighter (F1+F2), (F3+F4) or alternatives such as (F1) + (F2, F3, F4). By doing so, they reduce the class labels from multivariant to univariate classification, thereby inferring the problem to Yes/No (very mild or severe fibrosis). In S.Nandipati et al. Hepatitis C Virus (HCV) Prediction by Machine Learning Techniques 2020. We noted that the group achieved a similar result. Generally, their four class problems achieved an accuracy between 21 - 28% using KNN, SVM, RF, GNB, NN, Bagging and Boosting (Adaboost). Their results indicated to us that our low accuracy was not caused by some error in preprocessing but perhaps due to the limited number of class labels. We then compared all the publications [1,2,3,4,5] and confirmed that our results were on par with other groups using all four class labels. However, we also found that one of the initial publications [1] from 2006 applied the entire collected data not the publically available UCI data (Training dataset (𝑛 = 22690), Test dataset (𝑛 = 16877)) and achieved an 85.4% accuracy using KNN. We believe this is due to the data having 20x more power. In a very recent publication from September 2020 Sarma et al. [4] successfully detected all four stages and achieved an accuracy of 97% using a supervised ANN to make the prediction, evaluation of the results was done by cross-validation using the holdout method. Although they used Tensorsorflow while we used Pytorch to build the neural net, they added two distinct layers "Dense" and Dropout" which they state "helps break the classification problem" the model achieves above 90% accuracy on 60 epoch and 7% more using additional 40 epochs. As this is outside the course constraints for the project, we could not replicate the results. However, we found a conference paper by Ahammed et al. [6] that explained how a minor imbalance in the dataset affected the attainable accuracy in a four-class prediction. They proposed oversampling as a solution to the problem and reached a high accuracy of 94.4% We found it extremely interesting that the sparse and barely noticeable imbalances would be the cause of our unsatisfying 25% accuracy. We now understand that SMOTE (Synthetic Minority Oversampling Techniques) can be applied as data balancing technique and procreate similar occurrences using the raw records. Basically, SMOTE works by utilizing a k-nearest neighbour algorithm to generate synthetic data. SMOTE first start by picking random data from the minority class, then k-nearest neighbours from the data are set. Synthetic data would then be formed between the random data and the randomly selected KNN. Here is an illustration ![](https://imgur.com/1OB3hTS.png) Thereby SMOTE is increasing the number of samples in the minority class to balance out the number of samples in the majority classes. In the paper by Ahammed et al. [6] SMOTE was used eight times; it generated the new samples based on the KNN strategy. Thereby balancing and oversampling instances which lead to the formation of new cases in which 1344, 1328, 1420 and 1448 patients are in portal fibrosis, few septa, many septa and cirrhosis stage respectively. In total, the initial 1385 Egyptian patients in the dataset were upsampled to a total of 5540 "patients". In regression we see the tested models didn't surpass the baseline, which suggests the data is not easily modeled for regression. However the RLR had a somewhat stable regularization parameter throughout the folds, as did the ANN have a stable number of hidden units. Of the tested models, the baseline and RLR did not perform significantly different, but both were significantly better than ANN. ## Bibliography [1] Accurate Prediction of Advanced Liver Fibrosis Using the Decision Tree Learning Algorithm in Chronic Hepatitis C Egyptian Patients, Gastroenterology Research and Practice, Volume 2016, Article ID 2636390, 7 pages http://dx.doi.org/10.1155/2016/2636390 [2] Hepatitis C Virus (HCV) Prediction by Machine Learning Techniques, S.Nandipati et al. Application of modelling and simulation, http://arqiipubl.com/ams eISSN 2600-8084 VOL 4, 89-100, March 2020 [3] Applying Machine Learning to Evaluate for Fibrosis in Chronic Hepatitis C, Akella et al., medRxiv 2020.11.02.20224840; doi:https://doi.org/10.1101/2020.11.02.20224840, November 2020 [4] Artificial neural network model for hepatitis C stage detection, Sarma et al. EDU J. Computer & Electrical Eng. 01[01] 11-16, September 2020 [5] Comparison of Machine Learning Approaches for Prediction of Advanced Liver Fibrosis in Chronic Hepatitis C Patients. Hashem et al. IEEE/ACM Transactions on Computational Biology and Bioinformatics, DOI 10.1109/TCBB.2017.2690848, 2017 [6] Predicting Infectious State of Hepatitis C Virus Affected Patient's Applying Machine Learning Methods, Ahammed et al. 2020 IEEE Region 10 Symposium (TENSYMP), 5-7 June 2020, ## Appendix test regression on different data: [Avocado dataset](https://www.kaggle.com/neuromusic/avocado-prices) Results: λ: {10i, i = -20, -19, -18,..., 2} h: {1, 30, 60, 90, 120} |Outer fold|ANN|<---|RLR|<---|baseline| |-|-|-|-|-|-| |i|hi*|Eitest|λi*|Eitest|Eitest| |1|60|0.073|1.000e-06|0.157|0.169| |2|60|0.068|1.000e-06|0.148|0.159| |3|30|0.067|1.000e-06|0.152|0.164| |4|30|0.068|1.000e-07|0.146|0.157| |5|30|0.072|1.000e-06|0.152|0.162| **Baseline vs RLR:** p-value: 2.2336177589928404e-05 CI: (0.009952460042973394, 0.012727332072279293) **Baseline vs ANN:** p-value: 1.7720383478908699e-06 CI: (0.08640727448834253, 0.09837829800529098) **RLR vs ANN:** p-value: 2.2111140388914977e-06 CI: (0.07550264267553837, 0.08660313770284243)