# 02450 Introduction to Machine Learning and Data Mining --- # Report Project 1 **Group members**: - Jesper Vang - 094536 - Ole Batting - s173894 - Marco Placenti - s202798 **Contribution table**: | Name | Section | |- |- | | Jesper Vang | 2 & 3 | | Ole Batting | 1 & 4 | | Marco Placenti | 2, 3 & 4 | ### Links - [GitHub Code Repository](https://github.com/flight505/Shared_Intro_ML) - [Web Application](https://hepatitis-disease-prediction.herokuapp.com) --- ## 1. Dataset Description Based on various medical information, we are interested in predicting staging of liver fibrosis in patients. More specifically, we are interested in classifying whether a patient with hepatitis C presents with portal fibrosis, few septa, many septa or cirrhosis. A dataset based on Egyptian patients who underwent treatment dosages for HCV for about 18 months was found [here](https://archive.ics.uci.edu/ml/datasets/Hepatitis+C+Virus+%28HCV%29+for+Egyptian+patients). [[Nasr et al. 2017]](https://www.researchgate.net/publication/323130913_A_novel_model_based_on_non_invasive_methods_for_prediction_of_liver_fibrosis) achieved 99.48% average accuracy on 5-fold cross validation from a prediction model based on the data. Their goal was to present a non-invasive alternative to serial liver biopsies in assessing fibrosis. They used a Subsumtion Rule Based Classifier to evaluate rules from a discretization set, based on expert recommendations, to produce a decision tree. We hope to produce a model with hopefully a reduced feature dimensionality (PCA will be done to the full feature set) to predict staging. It is thus a classification task we are aiming for. This classification is the main machine learning aim. Besides this a regression will be done for a simil-continuous target comprised of the histological grading. The prediction would then be a continuous value which, even though the target is discrete, would still make sense/be useful. With both the classification and the regression we hope to predict the level of liver fibrosis. ## 2. Detailed description of data and attributes We start off with a description of all the variables in the dataset. | Attribute | Lable | Example | Discrete/Continuous | Nominal/Ordinal/Interval/Ratio | |----------------------------------|----------------------------------------|---------|---------------------|--------------------------------| | Age | Age | 56 | Discrete | Ratio | | Gender | Gender | 1 | Discrete | Nominal | | BMI | Body mass Index | 35 | Discrete | Ratio | | Fever | Fever | 2 | Discrete | Nominal | | Nausea/Vomting | Nausea/Vomting | 1 | Discrete | Nominal | | Headache | Headache | 1 | Discrete | Nominal | | Diarrhea | Diarrhea | 1 | Discrete | Nominal | | Fatigue & generalized bone ache | Fatigue & generalized bone ache | 2 | Discrete | Nominal | | Jaundice | Jaundice | 2 | Discrete | Nominal | | Epigastric pain | Epigastric pain | 2 | Discrete | Nominal | | WBC | White blood cell | 7425 | Discrete | Ratio | | RBC | red blood cells | 4248807 | Discrete | Ratio | | HGB | Hemoglobin | 14 | Discrete | Ratio | | Plat | Platelets | 112132 | Discrete | Ratio | | AST 1 | aspartate transaminase ratio | 99 | Discrete | Ratio | | ALT 1 | alanine transaminase ratio 1 week | 84 | Discrete | Ratio | | ALT 4 | alanine transaminase ratio 12 weeks | 52 | Discrete | Ratio | | ALT 12 | alanine transaminase ratio 4 weeks | 109 | Discrete | Ratio | | ALT 24 | alanine transaminase ratio 24 weeks | 81 | Discrete | Ratio | | ALT 36 | alanine transaminase ratio 36 weeks | 5 | Discrete | Ratio | | ALT 48 | alanine transaminase ratio 48 weeks | 5 | Discrete | Ratio | | ALT after 24 | w alanine transaminase ratio 24 weeks | 5 | Discrete | Ratio | | RNA Base | RNA Base | 655330 | Discrete | Ratio | | RNA 4 | RNA 4 | 634536 | Discrete | Ratio | | RNA 12 | RNA 12 | 288194 | Discrete | Ratio | | RNA EOT | RNA end-of-treatment | 5 | Discrete | Ratio | | RNA EF | RNA Elongation Factor | 5 | Discrete | Ratio | | Baseline histological Grading | Baseline histological Grading | 13 | Discrete | Interval | | Baselinehistological staging | Baselinehistological staging | 2 | Discrete | Interval | The dataset is well curated and there are not any missing values. | Data Set Characteristics: | Attribute Characteristics: | Missing Values: | |-----------------------------|----------------------------|-----------------| | Multivariate | Integer, Real | Non | It is also useful to see what are the ranges of values the attributes take. | | count | mean | std | min | 25% | 50% | 75% | max | |-------------------------------|-------|----------|------------|----------|----------|---------|---------|---------| | Age | 1385 | 4.63E+01 | 8.781506 | 32 | 39 | 46 | 54 | 61 | | Gender | 1385 | 1.49E+00 | 0.500071 | 1 | 1 | 1 | 2 | 2 | | BMI | 1385 | 2.86E+01 | 4.076215 | 22 | 25 | 29 | 32 | 35 | | Fever | 1385 | 1.52E+00 | 0.499939 | 1 | 1 | 2 | 2 | 2 | | Nausea/Vomting | 1385 | 1.50E+00 | 0.500174 | 1 | 1 | 2 | 2 | 2 | | Headache | 1385 | 1.50E+00 | 0.500165 | 1 | 1 | 1 | 2 | 2 | | Diarrhea | 1385 | 1.50E+00 | 0.500174 | 1 | 1 | 2 | 2 | 2 | | Fatigue&generalized_bone | 1385 | 1.50E+00 | 0.500179 | 1 | 1.00E+00 | 1 | 2 | 2 | | Jaundice | 1385 | 1.50E+00 | 0.500179 | 1 | 1 | 2 | 2 | 2 | | Epigastric_pain | 1385 | 1.50E+00 | 5.00E-01 | 1 | 1 | 2 | 2 | 2 | | WBC | 1385 | 7.53E+03 | 2668.22033 | 2991 | 5219 | 7498 | 9902 | 12101 | | RBC | 1385 | 4.42E+06 | 346357.712 | 3816422 | 4121374 | 4438465 | 4721279 | 5018451 | | HGB | 1385 | 1.26E+01 | 1.713511 | 10 | 11 | 13 | 14 | 15 | | Plat | 1385 | 1.58E+05 | 38794.7856 | 93013 | 124479 | 157916 | 190314 | 226464 | | AST_1 | 1385 | 8.28E+01 | 2.60E+01 | 39 | 60 | 83 | 105 | 128 | | ALT_1 | 1385 | 8.39E+01 | 2.59E+01 | 39 | 62 | 83 | 106 | 128 | | ALT4 | 1385 | 8.34E+01 | 26.52973 | 39 | 61 | 82 | 107 | 128 | | ALT_12 | 1385 | 8.35E+01 | 2.61E+01 | 39 | 60 | 84 | 106 | 128 | | ALT_24 | 1385 | 8.37E+01 | 2.62E+01 | 39 | 61 | 83 | 107 | 128 | | ALT_36 | 1385 | 8.31E+01 | 2.64E+01 | 5 | 61 | 84 | 106 | 128 | | ALT_48 | 1385 | 8.36E+01 | 2.62E+01 | 5 | 61 | 83 | 106 | 128 | | ALT_after_24_w | 1385 | 3.34E+01 | 7.073569 | 5 | 2.80E+01 | 34 | 40 | 45 | | RNA_Base | 1385 | 5.91E+05 | 3.54E+05 | 11 | 269253 | 593103 | 886791 | 1201086 | | RNA_4 | 1385 | 6.01E+05 | 3.62E+05 | 5 | 270893 | 597869 | 909093 | 1201715 | | RNA_12 | 1385 | 2.89E+05 | 2.85E+05 | 5 | 5 | 234359 | 524819 | 3731527 | | RNA_EOT | 1385 | 2.88E+05 | 2.65E+05 | 5 | 5 | 251376 | 517806 | 808450 | | RNA_EF | 1385 | 2.91E+05 | 2.68E+05 | 5 | 5 | 244049 | 527864 | 810333 | | Baseline_histological_Grading | 1385 | 9.76E+00 | 4.023896 | 3.00E+00 | 6 | 10 | 13 | 16 | | Baselinehistological_staging | 1385 | 2.54E+00 | 1.12E+00 | 1 | 2 | 3 | 4 | 4 | ## 3. Data visualization and PCA In order to get an initial understanding of the data, we plotted the density distributions for those variables that were medical measurements, such as number of red blood cells, ALT (alamine transamination), RNA and so on. In this category, for simplicity of representation, we also included the age even though it is not a medical measurement. ![](https://i.imgur.com/kwniGlN.png) Overall, we can appreciate an evenly and uniformely distributed set of features. However, few peculiarities and potential issues came up. In fact, we noticed that some of the *alamine transamination (ALT)* had some outliers on the left most part of the distribution. Being very few data points, the issue is easily addressable by replacing those values with the mean of the distribution. A bigger problem was instead identified in the last three measurements of the last row (reading from left to right), where around 27\% (385 out of 1385) of the records have shown to have not realistic values. This can also be seen in the statistical summary table shown in the previous paragraph. Due to a lack of documentation by the data curators we could not figure out why that was. The aforementioned techniques of replacing the outliers with the mean of the distribution does not seem like a suitable approach, due to the fact that the outliers represent a significant part of the data. Therefore, we might decide to discard such features all together. Additionally, the measurement relative to the Hemoglobine - which assumes only six different values throughout the dataset - might as well be represented as a categorical variable when designing the model. Furthermore, we checked if any of these features could have helped us discriminating between classes. In the y-axis we placed the *Histological staging* while in the x-axis we placed the same features we used in the previous plot. We clearly see that there are no clear clusters and the feature values are well distributed among the classes. ![](https://i.imgur.com/sM4lJsx.png) <br> For completeness, we also plot a count of the categorical features. In these we show the distribution of genders, BMIs and several sympthoms. Here, we also see a nice and uniform distribution and no class imbalances. ![](https://i.imgur.com/dOTbZb8.png) Just as we did before, we check if any of the sympthoms, BMI or the gender can be used to discriminate between classes. ![](https://i.imgur.com/gtQb74I.png) We see that it is not the case, also for these features. <br> For the classification task, our dataset does not show any sign of class imbalance. All four possible classes are overall uniformely distributed. ![](https://i.imgur.com/CVqSYoG.png) Although we have decided to use our dataset for a classification task, as mentioned in the introduction, it could be easily turned into a regression task. In fact, together with the *Histological staging* we are provided with the *Histological grading* as well. This measure indicates how quickly the disease is spreading - and for hepathisis is expressed through a value ranging from 1 to 16. Being the range four times larger than the staging, it is better suited for a regression task. ![](https://i.imgur.com/kmITjvJ.png) Due to the nature of the dataset, with some attributes having very large ranges and some other being categorizable, some data manipulation was required before applying PCA. Eight of the twenty-seven attributes have been one-hot encoded (namely, all the categorical features except the BMI), bringing the final dimensionality to thirty-five attributes. After doing so, the data has been standardized - so each column has been transformed to have zero mean and standard deviation equal to one. The result of the PCA application can be seen in the following plot. ![Figure 1](https://i.imgur.com/f57YRRp.png) The result of the component analysis is that twenty-three components need to be used to retain at least 90\% of the variance in the data. Reducing the number of components - and hence dimensionality of the data - will clearly cause an almost constant increase of information loss. Dimensionality reduction does not seem to suit very well the chosen data. The cross-correlation among features does indeed confirm this belief. In fact, there are not correlated attributes that create redudancy in the dataset. ![](https://i.imgur.com/KhQK2wj.png) In the above cross correlation matrix, we see obviously a perfect correlation in the main diagonal and a perfect negative correlation between the binary variables that have been 1-hot encoded. Also we see a weird light correlation between the last three columns, namely for the reasons we have explained before, i.e. high frequency of outliers and corrupted data that we will get rid of. ## 4. Discussion of what we have learnt We found that [[Nasr et al. 2017]](https://www.researchgate.net/publication/323130913_A_novel_model_based_on_non_invasive_methods_for_prediction_of_liver_fibrosis) achieved 99.48% average accuracy on 5-fold cross validation from a Subsumtion Rule Based Classifier with staging as target. Overall, the dataset is very well curated despite some small inconsitencies in some of the measurements. It is still very difficult to say whether our machine learning aim will be able to deliver good performance since we have not seen anything that suggests an obvious discrimination criterion. In fact, all features seem to be uniformely distributed - even among the four stages we set as a target. ## 5. References [Nasr et al. 2017] M. Nasr, M. Hamdy, S. M. Kamal and K. Elbahnasy: "A novel model based on non invasive methods for prediction of liver fibrosis", Researchgate, December 2017.