Transfer Learning for COVID-19 Detection

<h1> Transfer Learning for COVID-19 Detection </h1> ##### By Maarten van 't Wout (4341414), Joost Reniers (4368711) and Erik Vester (4305388) ![](https://i.imgur.com/IcsdQvh.png) *[Image Source](https://www.nreionline.com/)* **GitHub & Colab** [![](https://i.imgur.com/PepVoFr.png)](https://github.com/jjcreniers/Covid19Detection) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1x1hfAmBbmBZwFxbEMPaMBbBxhb_Eqyt1?usp=sharing) The past months the world has been grasped by the disease caused by the new Coronavirus (COVID-19). In the Netherlands alone the virus has infected more than 48.000 people (as of May 11th 2020) and caused the death of more than 6000 people ([**RIVM**](https://www.rivm.nl/coronavirus-covid-19/actueel)). The Dutch health instances have been struggling to supply enough testing equipment to detect if an individual is infected by the deadly virus. This lack of testing is troublesome for the fight against the spread of the virus and for the treatment of the ones infected. In order to combat this we have looked into creating a deep learning based solution for COVID-19 detection. Due to the lack of COVID-19 medical image data, ways have to be found to effectively apply deep learning. By using transfer learning and finetuning it might be possible to cope with this shortage of data. In this post we will compare two main transfer learning strategies and show promising results (spoiler: 98% accuracy on our test set). By doing this, we hope to answer the following research question: *"Is transfer learning from a network that has been trained on data from the same specific domain - for example X-ray images - better than a network trained on a larger, more general dataset, such as ImageNet?"* ### *So, what is actually shown in this blogpost?* In order to answer the research question, several different experiments are performed. These experiments will give an insight in which method of transfer learning - from a specific comparable task or from a broader taks - yields a better model for detecting COVID-19. Furthermore, the results will tell us more about the possible importance of similarity of domains when doing transfer learning. For example, a model pre-trained on an X-Ray dataset has a much higher similarity to the tasks of COVID-19 detection than a model pre-trained on ImageNet. Does this have an effect on the obtained results? This will be shown by two experiments with transfer learning on DenseNet models. These two methods of transfer learning are shortly introduced below and the naming for these two models will be used througout the rest of the blogpost: 1. **ImageNet-Model:** A DenseNet121 model which is first trained on ImageNet and then used for transfer learning on our Covid-19 dataset. 2. **CheXNet-Model:** A DenseNet121 model which is first trained on ImageNet, then on the CheXpert dataset - a large dataset of chest X-ray images with 14 different classes - and only then used for transfer learning on our Covid-19 dataset. In this blogpost we will first give a short introduction on Covid-19 detection through X-ray images and discuss some related work. This will be followed by an overview of used datasets, after which the used network (DenseNet) will be introduced. Furthermore, an introduction to transfer learning will be given as well as an overview of how we implemented this in code for this experiment. Hyper-parameters will be tuned for both the above mentioned ImageNet-Model and CheXNet-Model to make sure that there is a fair comparisson between these two models. Lastly, the results for the best hyper-parameter settings will be discussed and the two methods of transfer learning will be compared. # X-ray images for COVID-19 detection Polymerase chain reaction (PCR) tests are the gold standard for virus detection and the article by [**Emily Waltz [2020]**](https://spectrum.ieee.org/the-human-os/biomedical/diagnostics/testing-tests-which-covid19-tests-are-most-accurate) tells us that PCR tests achieve a 100% sensitivity on positive samples and 96% sensitivity on negative samples. However, that is in laboratory settings. It is difficult to determine the real-world positive rate is, but the research performed by [**Adam Sturts [2020]**](https://www.mdmag.com/medical-news/comparing-rt-pcr-and-chest-ct-for-diagnosing-covid19) suggests the clinical sensitivity is about 66 to 80%. This article also suggests that the sensitivity might increase by complementing PCR tests with anlysis of chest photos by radiologists. Furthermore, research by [**Tao Ai et. al [2020]**](https://pubs.rsna.org/doi/10.1148/radiol.2020200642) concludes that CT scans have a high sensitivity for COVID-19 and could be used as a primary tool for detection of this virus. To detect COVID-19 it is important to be able to differentiate between COVID-19 infection and a regular pneumonia caused by bacteria or other viruses. The research by [**Harrison X. Bai et al. [2020]**](https://pubs.rsna.org/doi/full/10.1148/radiol.2020200823) shows that even experienced radiologists can make mistakes in distinguishing this. By using a deep neural network we hope that we can help with improving the performance of radiologists and offer an extra method for professionals to check their diagnosis. However, the network is not meant as a replacement for radiologists and we still recommend to perform PCR tests for more reliable results. ![](https://i.imgur.com/r3LegGg.png) ###### *Illustrative Examples of Chest X-Rays in Patients with Pneumonia from ["Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning"](https://www.cell.com/cell/fulltext/S0092-8674(18)30154-5)* # Related Work Since COVID is a worldwide pandemic, work has been done on this topic all of the world. In order to give an overview of what has already been done, we present some related work from other researchers. For example, the work in the blog post by [**Daksh Treha [2020]**](https://towardsdatascience.com/detecting-covid-19-using-deep-learning-262956b6f981) shows COVID detection on X-Rays photos. For this, Daksh Treha uses a dataset from Kaggle and a dataset from Joseph Paul Cohen. You will read more about this dataset later on. Besides detecting COVID, there have been researchers like [**Yuan Tian [2018]**](https://becominghuman.ai/detecting-pneumonia-with-deep-learning-3cf49b640c14) focusing on the detection of pneumonia by using Deep Learning. Here the 34-layer ResNet (ResNet34) is used, which is pre-trained on the ImageNet classification dataset and afterwards transfer learned on the Kaggle dataset mentioned before. The Kaggle dataset is used for both healthy and unhealthy lungs. Also, in the Netherlands [**Delft Imaging**](https://www.delft.care/covid-19-publications/) have developed a COVID-19 diagnosis algorithm, CAD4COVID, which can detect COVID and also diagnose the current status of the disease. The base for this algorithm is the Tubercolose diagnosis tool they have developed for tuberculose research in Africa. One of the more interesting papers we came accross is the one from [**Pranav Rajpurkar, Jeremy Irvin et al. [2020]**](https://stanfordmlgroup.github.io/projects/chexnet/) from the Stanford research group. It introduces a model called CheXNet and it includes a pneumonia detection algorithm and a new dataset, more on this later. The blog post by [**Manu Joseph [2020]**](https://towardsdatascience.com/does-imagenet-pretraining-work-for-chest-radiography-images-covid-19-2e2d9f5f0875) is comparable to ours. Here, research has gone into determining the best architecture for COVID detection. DenseNet-121, Xception and a combination of both is pretrained on ImageNet and analyzed with Grad-CAM and t-SNE visualizations. A quote from the author: ***'On the larger point, with which we have started this whole exercise with, I think we can safely say that ImageNet pretraining does help in classification of Chest Radiography images of COVID-19 (I did try to train DenseNet without pretrained weights on the same dataset without much sucess).'*** As you can see in the table below, Manu Joseph got compelling results from his approach using ensembles. It is not within the scope of this research to ensemble multiple models, but it would be interesting to see how our results compare. ![](https://i.imgur.com/UoQeXaX.png) Furthermore, in this blogpost we are researching whether pre-training on ImageNet or a more task-specific dataset is better for transfer learning. Previous work by [**Minyoung Huh et al.[2016]**](https://arxiv.org/pdf/1608.08614.pdf) from UC Berkeley already researched why ImageNet is good for transfer learning. Lastly, on GitHub there is an extensive list of COVID-19 related papers which can be found here: [**COVID-19 Imaging AI Paper List**](https://github.com/HzFu/COVID19_imaging_AI_paper_list). # Assembling the dataset Before we can run our experiments, we first have to assemble a dataset. The final dataset is comprised of two classes, each consisting of X-ray images, one with the label 'COVID' and the other with the label 'No-COVID'. This dataset is created by combining two seperate datasets. One dataset is made by Joseph Paul Cohen, which contains the COVID-19 X-Ray images. The other dataset contains images of healthy patients and patients with non-COVID-19 related pneumonia, which are extracted from the CheXpert dataset from the Stanford Medical Center. ## Evaluated databases Before we decided on the two datasets mentioned above, we considered 4 datasets in total: - Joseph Paul Cohen's COVID-19 dataset - Chest X-ray Pneumonia dataset - National Institute of Health (NIH) dataset - Stanford CheXpert dataset In the following paragraphs, we will describe how these datasets were evaluated to obtain the two classes of our dataset. ### COVID **COVID-19 Image Data Collection** Since there is not a lot of public COVID-19 data available for chest X-ray images, Paul Joseph Cohen decided to create [**an open dataset on GitHub**](https://github.com/ieee8023/covid-chestxray-dataset) containing data of patients diagnosed with COVID-19 and data from patients with comparable diagnosis, for example SARS and MERS. Most of this data is derived from publications and is updated regularly. **From this database 138 images containing data from COVID-19 patients are used for our dataset.** ### Non-COVID **Chest X-ray Pneumonia dataset** A dataset that is commonly used to train Deep Neural Nets on COVID-19 detection is the [**Chest X-ray Pneumonia**](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) dataset, which contains X-ray images of pneumonia caused by both viruses and bacterias. The dataset consists of labeled chest X-ray images (AP) selected from patients of one to five years old from Guangzhou Women and Children’s Medical Center. Since the images are from children, we decided not to use this dataset for our detection method. **Using this database to train, the network might bias towards detecting children instead of non-COVID-19 related pneumonia.** **National Institute of Health (NIH) dataset** The [**NIH dataset**](https://www.kaggle.com/nih-chest-xrays/data) consists of 112,120 label X-ray images from 30,805 patients. The labels are made by analyzing the associated radiologic reports and are expected to have an accuracy of more than 90%. This is a huge dataset compared to what was previously available. However, after reading the article by [**Luke Oakden-Rayner [2019]**](https://lukeoakdenrayner.wordpress.com/2019/02/25/half-a-million-x-rays-first-impressions-of-the-stanford-and-mit-chest-x-ray-datasets/), we think the CheXpert dataset might be a better option. **CheXpert dataset** The CheXpert dataset is a very large dataset containing 224,316 images form 65,240 patients. The data is collected from Stanford Hospital and is derived from radiology reports dating from October 2002 to July 2017. Based on these dates it is impossible for the dataset to contain any COVID-19 data. Each report was analyzed by an automated rule-based labeler to extract observations from the text. These observations were used as a structured label for the images. The reports are labeled for the presence of 14 observations as positive, negative or uncertain. A part of the labels of the the dataset is then compared with the classification of 8 certified radiologists. The dataset achieves an AUC between 0.85 and 0.97 on the different observations, which means the labels are reliable. According to the article by Luke Oakden-Rayner, the labels of the CheXpert dataset are better defined and more clinically relevant than the NIH dataset. However, there are still minor flaws in the labels of the dataset but it seems that this is inevitable which datasets of this size. **From this database 680 images from healthy patients and 680 images from patients with pneumonia (viral and bacterial) are used for our dataset.** ## Processing the datasets ### Non-Covid: By registering on [**the CheXpert website**](https://stanfordmlgroup.github.io/competitions/chexpert/) you get access to the CheXpert database. The downloaded folder contains many folders with images and one Excel file that names all the imagepaths and the one-hotencoding of what is visible in the image. First, the Excelsheet is filtered twice to get new files, one with only the imagepaths of the healthy patients and then one with only the imagepaths of the patients with pneunomia. Another criterium that is filtered on, is how the images were taken. Only the Anteroposterior (AP) views are used, these are images made from front-to-back. We have chosen for the AP view since this is a frequently used way of evaluating patients that are severely ill. **By evaluating only the AP view, the network will not bias on the position the X-ray images were taken.** We wrote a small script that goes over all the image paths and then transfers them to a specified folder. In the example below, all the healthy X-rays are copied to a folder. The same is done for the pneunomia X-rays and then these are combined into a one folder that contains the healthy class. ```python= import openpyxl import os import shutil import sys path = 'E:/CheXpert-v1.0-small/train_nofindings_AP.xlsx' wb = openpyxl.load_workbook(filename = path) ws = wb['Sheet1'] for index, cell in enumerate(ws['A']): if index != 0: source = "E:/" + cell.value target = "E:/CheXpert-v1.0-small/trainsplit/nofinding_" + str(index) + ".jpg" print(str(index)) # adding exception handling try: shutil.copy(source, target) except IOError as e: print("Unable to copy file. %s" % e) exit(1) except: print("Unexpected error:", sys.exc_info()) exit(1) ``` ### Covid: For the Covid-19 data, a similar script was used to select all images labelled as Covid-19 which had an AP view. Most of this code was already provided at the Github of the Paul Joseph Cohen dataset and will therefore not be discussed here. Our complete dataset can also be donloaded from our Github. ## Solving class imbalance ##### *Further reading : [TowardsDataScience](https://towardsdatascience.com/what-to-do-when-your-classification-dataset-is-imbalanced-6af031b12a36)* The data we have selected so far is skewed by the lack of COVID-19 images, as can be seen in the table below. This means there is a high imbalance between the two classes, COVID and non-COVID. <center> | COVID-19 | Positive, 'COVID' | Negative, 'Non-COVID' | |--------------|-------------|-------------| | Train | 83 | 815 | | Validation | 28 | 272 | | Test | 27 | 273 | | **Total** | 138 | 1380 | </center> Our 'non-COVID' class was orders of magnitude larger than our 'COVID' class. Therefore we did some research on how we should handle this imbalance. Our imbalance was caused by the positive group having few samples. Such an imbalance can be solved in three ways. One is **oversampling**. Oversampling is done by copying the same positive files multiple times, optionally with data augmentation, till both sets are of equal size. The drawback of only using this augmentation on the small class is that the network might train on the augmentations. Another method is **undersampling**, this means that you remove samples from the dominant label. By doing this the amount of samples in both sets becomes equal. However, by removing samples valuable information is lost. The third method is **balancing class weights** of the loss function. By changing the weighting of the classes in the loss function, the network calculates how large an update should be for each sample. In the case of our model this means that by increasing the weights of the loss function for the positive samples, the model will update the weights of the network evenly for both classes. Otherwise the model would aim for a high accuracy, which means classifying every input as a negative class and reaching 90% accuracy in the case of a 1/9 'COVID'/'non-COVID' sample balance. In our case, the best of both worlds is used. **The imbalance was solved by using class weights and data augmentation is applied as well.** The data augmentation is used to increase the size of both classes. The class weight were calculated by: ```python= y_integers = np.argmax(labels, axis=1) class_weights =compute_class_weight('balanced', np.unique(y_integers), y_integers) d_class_weights = dict(enumerate(class_weights)) ``` This automatically calculated the weights, which comes down to: ```python= d_class_weight = {0: 10., 1: 1.} ``` Which means: treat every instance of class 0 as 10 instances of class 1. TensorFlow and Keras can then automatically use these weights in the loss function. For our dataset, we decided to work with 10 times more non-COVID images as COVID images and we balanced the loss function according to this ratio. ## Data preprocessing Now that the data is neatly divided into the correct classes, it is time for the next steps. Preprocessing, splitting the data into training, test and validation data and performing data augmentation. In this section an overview is given of which processing steps are performed and how this is done. ### Reading, Resizing and Coloring The snippet below shows how to get arrays of labels and data from the dataset. In line 1-6, a list of images is loaded and the arrays are created. In line 9 the loop over the imagepaths is initialized to consider all the images seperately. In line 11, the label is extracted and in line 15-18, [**OpenCV**](https://opencv.org/) is used to read the image, then convert the colorspace to RGB and resize the image to 224x224 pixels. In line 20, the pixel intensities are scaled to the standard range [0,255]. Using all this, different images can be used as input, which are then preprocessed to have the same structure and dimensions. Last of all, in line 29-30, the labels are transformed. First the labels were class numbers, and these lines transform them into one-hot encoding. ```python= # grab the list of images in our dataset directory, then initialize # the list of data (i.e., images) and class images print("[INFO] loading images...") imagePaths = list(paths.list_images(args["dataset"])) data = [] labels = [] # loop over the image paths for imagePath in imagePaths: # extract the class label from the filename label = imagePath.split(os.path.sep)[-2] # load the image, swap color channels, and resize it to be a fixed # 224x224 pixels while ignoring aspect ratio image = cv2.imread(imagePath) image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = cv2.resize(image, (224, 224)) # update the data and labels lists, respectively data.append(image) labels.append(label) # convert the data and labels to NumPy arrays while scaling the pixel # intensities to the range [0, 255] data = np.array(data) / 255.0 labels = np.array(labels) # perform one-hot encoding on the labels: FIRST ENTRY IS COVID, SECOND IS NORMAL lb = LabelBinarizer() labels = lb.fit_transform(labels) labels = to_categorical(labels) ``` ### Data splitting After loading all the data into the data variable, it is time to split up the data. The data is plit into training, validation and test data. The training data is used for training, the validation data is used to keep track of the model's performance during training and the test set is kept seperately to test the performance of the final model. The splitting is done using the sklearn.model_selection.test_train_split function, this function splits arrays into random train and test subsets. As can be seen, this splitting is applied twice. **The first time data is split into 80% training data and 20% test data. Then, 25% of the 80% of training data is used as validation data.** In train_test_split, another parameter is used, the random_state, this is a seed that controls the shuffling applied to the data before applying the split. ```python= # partition the data into training and testing splits using 80% of # the data for training and the remaining 20% for testing (trainX, testX, trainY, testY) = train_test_split(data, labels, test_size=0.2, stratify=labels, random_state=42) # take 25% of the training data for validation (trainX, valX, trainY, valY) = train_test_split(trainX, trainY, ... ...test_size=(0.25), random_state=42) # 0.25 x 0.8 = 0.2 ``` ### Data augmentation: Rotating By using the code snippet below data augmentation can be performed. Data augmentation is a technique that can be used to artificially expand the size of a training dataset by creating modified versions of images in the dataset. In our case, **the dataset is expanded by adding the original images rotated by 15 degrees to the dataset**. This is done using the ImageDataGenerator class. This class support image data augmentation in the Keras deep learning library. ```python= # initialize the training data augmentation object trainAug = ImageDataGenerator( rotation_range=15, fill_mode="nearest") ``` # The network Since not a lot of X-ray images of COVID-19 patients are publicly available, we choose to built a neural network using transfer learning. In our search for established X-ray classification networks we came across CheXNet, which seems like a good canditate to transfer learn from. The following section explains what it is and where it is based on. ### CheXNet ##### *Further reading: [CheXNet GitHub Project Website](https://stanfordmlgroup.github.io/projects/chexnet/)* Researchers from Stanford have made a convolutional neural network that takes a chest X-ray image as input and outputs the probability of pneumonia. The researchers claim that they have made *"an algorithm that can detect pneumonia from chest X-rays at a level exceeding practicing radiologists"*. The network is based on the DenseNet-121 architecture and the dataset used is the CheXpert dataset. They train their model on 98637 chest x-rays, have a validation set of 6351 x-rays, and test it on 420 x-rays. **As can be seen below, the model obtains good results on pneumonia detection and this is the reason this network was choosen to transfer learn from.** Since the labels are not 100% accurate, the Stanford team relabelled the test set and they used their own radiologists to relabel all the test cases. In the table below, it can be seen that the CheXNet performs better on the task of pneumonia detection than 3 of the 4 radiologists. Also, the inaccuracy of the original labels can be seen. It is a shame we could not use the test set with the improved labels for our purpose, since it is not publicly available. | | F1-score | |--------------|----------------| | Radiologist 1 | 0.383 | | Radiologist 2 | 0.356 | | Radiologist 3 | 0.365 | | Radiologist 4 | 0.442 | | CheXNet | 0.435 | | Original labels | **0.288** | As can be concluded from this table, the CheXNet seems to perform at least on par with human experts. However, comparison with more expert would be preferable. ### DenseNet ##### *Further reading: [Original DenseNet paper](https://arxiv.org/pdf/1608.06993v3.pdf), [DenseNet Code](https://github.com/liuzhuang13/DenseNet)* As you now know, the Stanford CheXnet research obtained cutting-edge results using the Dense Convolutional Network (DenseNet) architecture for the evaluation of chest X-rays. In order to be able to compare the results of transfer learning on ImageNet and transfer learning on the CheXpert trained CheXNet, we use the same architecture: **DenseNet-121**. When CNNs go deeper, gradients vanish and information is lost. By connecting every layer with every other layer, a DenseNet ensures maximum information flow. Each layer has access to its preceding feature maps and therefore the "knowledge" of the rest of the layers. This can be seen in the image below. Also, by connecting this way fewer parameters are required, since there is no need to learn redundant feature maps. The DenseNet-121 is the simplest DenseNet designed over the ImageNet dataset. For our transfer learning model, we will use a network pretrained on ImageNet from [Keras](https://keras.io/api/applications/densenet/). Fun fact: one of the authors of the DenseNet paper used to teach at TU Delft. ![image alt](https://i.imgur.com/hKAyMlS.png "dense" =500x300) *DenseNet with 5 layers with expansion of 4.* *[Image Source](https://arxiv.org/pdf/1608.06993v3.pdf)* # Transfer Learning and Fine Tuning ##### *Further reading: [Ruder.io - Transfer Learning](https://ruder.io/transfer-learning/), [Stanford - Transfer Learning](https://cs231n.github.io/transfer-learning/)* A neural network normally requires a dataset of sufficient size to be trained. Our final dataset only has 138 images of the COVID class, which is not much. Because the lack of data and computational resources, it is impossible for us to train a large network from scratch and it would be prone to overfitting. However, it is possible to use a pretrained model to transfer learn from. Since we do not have a lot of data it would be wise to choose a model that is not too complex, in order to avoid overfitting. As Kaiming He mentioned in his paper [**Rethinking ImageNet Pre-training [2018]**](https://arxiv.org/abs/1811.08883), ImageNet pre-training reduces research cycles, leading to easier access to encouraging results and fine-tuning from pretrained weights converges faster than from scratch. A paper by [**Maithra Raghu [2019]**](https://arxiv.org/abs/1902.07208) from Google Brains also backs this up for medical imaging, which our network is used for. Furthermore, in a blog post by [**Francois Chollet [2016]**](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) and a blog post by [**Josh Janzen [2018]**](https://joshjanzen.com/cnn-transfer-learning-vs-build-from-scratch/) it is shown that better results are obtained from pre-training a large network vs training a smaller network. Therefore, transfer learning is ideal for our current task. We choose to do transfer learning with DenseNet-121 pretrained on ImageNet, since it is trained on image recognition which is similar to our radiology diagnosis task. Furthermore, the pre-trained net is already trained on millions of images which means the network has already learned a lot of features that can be used for our purpose, especially the low-level ones can be useful for us. Also, it is not necessary to pre-train this model ourselves since pretrained models are widely available. For the CheXNet-model, both ImageNet and the CheXpert dataset is used to pre-train. This pre-trained model can be found in our GitHub. So, Transfer Learning and Fine Tuning are both machine learning methods that can be useful when you have insufficient data for a new domain you want handled by a neural network and there is a big pre-existing data pool that can be transferred to your problem. Below, it is explained how the ML methods of transfer learning and fine tuning actually work. ## Transfer Learning and Fine Tuning, what is it? Transfer learning is a general term and can be used in many ways, but the most common way of transfer learning is how it is used by us. First, layers from a previously used model are taken and you freeze those layers. Meaning, the weights between these layers are kept constant to preserve the information learned during previous training. Afterwards, a new layer - or multiple new layers - is added on top of the frozen layers. This new layer is the new output layer and will be trained to make predictions on the new dataset. The term fine-tuning is often used interchangably with transfer learning, but to be more specific it is the last optional step when doing transfer learning. After you trained your last layer to the specific task at hand. All layers are unfrozen and the whole model is re-trained on the new data. The pre-trained features are now being adapted to the new data. This has to happen with a small learning rate to prevent destoying the pre-trained features. Fine tuning should not be done alone, because then the randomly-initialized weights of the last layer will cause nullification of the pre-trained features. The reason fine tuning is important is that the task on which the model was originally trained on was different from the new task. So, there will be some information that the model has learned that may not apply to our new task, or there may be new information that the model needs to learn from the data regarding the new task that was not learned from the previous task. For this research we will only apply transfer learning. However, a code for fine tuning is prepared and can be found in the appendix. The image below shows three different transfer learning strategies. Strategy 1 can be regarded as fine tuning. Strategy 3 is the transfer learning strategy that we have applied. ![](https://i.imgur.com/UUK8Ae5.png) [*Image Source*](https://towardsdatascience.com/transfer-learning-from-pre-trained-models-f2393f124751) ## Transfer Learning in code In the code below, it is shown how our new networks are constructed that are used for transfer learning. Initially, the baseModel is loaded, one of the Densenets is loaded, the Densenet is discussed in the later section about Densenet. Then the headmodel is constructed, this is the part that will be trained. AveragePooling2D is used to reduce the variance and to reduce the computation complexity. Average pooling is preferred over max pooling here, since average pooling extract features more smoothly and max pooling would extract the more extreme features, which should be avoided when transfering between domains. Flatten is then used to change the data into the correct format for the next dense layer to interpret. Transfer learning can be done by adding one dense layer, but in this case a dense layer of 64 neurons with a ReLu activation function is used before the final output layer. This is done because with multiple stacked layers a richer set of features can be learned. Between the two last layers of neurons a dropout layer is used. How this dropout works is explained in the hyperparameter section. Finally, the output layer of 2 neurons is added with a softmax classifier that predicts the probabilities for each class label. ```python= baseModel = load_model(args["inputModel"]) # PRINT SUMMARY OF MODEL baseModel.summary() # construct the head of the model that will be placed on top of the # the base model headModel = baseModel.output headModel = AveragePooling2D(pool_size=(4, 4))(headModel) headModel = Flatten(name="flatten")(headModel) headModel = Dense(64, activation="relu")(headModel) headModel = Dropout(0.5)(headModel) headModel = Dense(2, activation="softmax")(headModel) # place the head FC model on top of the base model (this will become # the actual model we will train) model = Model(inputs=baseModel.input, outputs=headModel) ``` Now, you loop over the layers in the base model and set them to be untrainable. This freezes these layers, so only the layers in our headmodel remain trainable. ```python= for layer in baseModel.layers: layer.trainable = False ``` Now, the model can be compiled and the final layers can be trained. The Adam optimizer is used here, which can be seen as an extension to Stochastic Gradient Descent. An important characteristic of this optimizer is that the learning rate decays, so the learning rate argument to the Adam function is only the starting rate. Then binary cross entropy loss is used, which is a good loss function when working with two classes. In the last line of the below code you see that the class weights from the previous section about the dataset are used. ```python= # compile our model print("[INFO] compiling model...") opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS) model.compile(loss="binary_crossentropy", optimizer=opt, metrics=["accuracy"]) # train the head of the network print("[INFO] training head...") H = model.fit_generator( trainAug.flow(trainX, trainY, batch_size=BS), steps_per_epoch=len(trainX) // BS, validation_data=(valX, valY), validation_steps=len(valX) // BS, epochs=EPOCHS, class_weight = d_class_weights) ``` # Hyperparameter tuning The hyperparameters are tuned for both networks to create a fair comparison. In this section on hyperparameter tuning, it will first be discussed which parameters will optimized and which values were tried for these parameters. Then, Talos is introduced, a program that helps with hyperparameter tuning. Lastly, our tuning results are presented and discussed. ## Considered Hyperparameters ### Learning rate The **learning rate** is important for the time it takes to compute the optimal solution. When a too small learning rate is choosen, it will take a long time to converge to the optimum solution and when a too large learning rate is set the network might not converge to the optimum solution at all. An illustration of this can be seen below. ![](https://i.imgur.com/qsueQc5.png "https://www.jeremyjordan.me/nn-learning-rate/") ###### Source: https://www.jeremyjordan.me/nn-learning-rate/ In our optimization three different learning rates were used for optimization. These learning rates were: 0.0005, 0.0001, 0.001. The 0.001 value was used by [Adrian Rosebrock](https://https://www.pyimagesearch.com/2020/03/16/detecting-covid-19-in-x-ray-images-with-keras-tensorflow-and-deep-learning/) , who also made a model for Covid detection and used this learning rate. After some preliminary testing lower learning rates often gave better results, therefore these smaller learning rates are also explored. ### Batch size Besides the learning rate, another important hyperparameter is the **batch size**. The batch size is the amount of samples the network uses per iteration to calculate the gradient and alter the weights. When using a learning algorithm called mini-batch gradient descent, the batch size used is larger than 1 and smaller than the size of the training set. By using a larger batch size, the gradient is evaluated for more samples per iteration and might be more accurate than the gradient for smaller batch sizes. Calculating the gradient of more samples at a time is computationally more expensive, but you would need less iterations to converge to the local minimum. However, using a larger batch size during training may result in worse generalization behavior which results in a lower quality model. This may be due to the fact that models with large batch sizes tent to converge to sharp minimizers of the training function according to the research done by [**Keskar et al. [2016]**](https://arxiv.org/pdf/1609.04836.pdf). The values tested for batch size are 16, 32, 64. These are some standard batch sizes and the larger batch sizes are ommited, since our dataset is small and our computational resources are limited. ##### *Further reading: [Batch size vs learning rate]( https://arxiv.org/pdf/1711.00489.pdf)* ### Type of optimizer The performance of two different optimizer is compared, the Adam optimizer and the classic stochastic gradient descent (SGD). The [**Adam optimization algorithm**](https://arxiv.org/abs/1412.6980), which provides a combination of the [**Adaptive Gradient Algorithm (AdaGrad)**](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) and the Root Mean Square Propagation (RMSProp, proposed by [**Geoff Hinton**](https://en.wikipedia.org/wiki/Geoffrey_Hinton)). As proposed in [**'On Empirical Comparisons of Optimizers for Deep Learning'**](https://arxiv.org/abs/1910.05446), a more general optimizer should perform better or on par with one it approximates. With this in mind we choose the **Adam optimizer**. By gradually decaying the learning rate, Adam provides results like the middle figure in the illustration above. The initial learning rate that is set is the maximum learning rate the Adam optimizer uses for every single learning rate. In preliminary testing, it turned out that the validation loss and accuracy were alternating a lot with Adam, therefore the slower basic SGD algorithm is also explored, since it gives a smoother graphs. ### Dropout Dropout is used to prevent overfitting and is important for our cause, since the amount of data is limited. The dropout randomly disconnect nodes between these layers with a probability of the specified percentage. Since transfer learning is done, the dropout only applies to the last layers. Dropout percentages for the last layers are often around 40% or 50%. A dropout of 50% also gives the highest variance for the distribution of networks. Thus, the values used for dropout are 0% and 50%. We simply test whether dropout improves the performance. ## Using Talos for a grid-search ###### ***More info on Talos: https://autonomio.github.io/docs_talos*** Talos is a program that can help with an automated workflow of hyperparameter tuning, when using Keras. The possibilities of automating this tuning with Talos are large, but in our case Talos is only used to first specify all hyperparameters in a grid and then to run the model with all these different parameters and to output the performances in CSV files. How these individual steps work will be explained below. There was a setback that Talos cannot perform more than 9 consecutive runs with different model parameters on Google Colab without disconnecting. Therefore, the search grid had to be manually specified 18 times with different parameters to cover all parameters.   In the snippet below, it can be seen how our total Talos grid was specified. ```python= import talos as ta from keras.optimizers import Adam, Nadam from keras.activations import softmax from keras.losses import categorical_crossentropy, logcosh p = {'lr': [0.0001, 0.001, 0.01], 'lr_finetune': [0.0001, 0.00001, 0.000001], 'batch_size': [16, 32, 64], 'epochs': [25, 50], 'dropout': [0, 0.50], 'optimizer': [Adam, SGD], 'loss': ['binary_crossentropy'], 'last_activation': ['softmax']} ``` After specifing the grid *p*, the Talos scan operation could be performed. This scan operation takes the result of the epoch with the highest value. In the snippet below, this scan operation is found. All parameters are explained earlier, except 'model' this variable specifies the function in which you generate your model. ```python= scan_object = talos.Scan(trainX, trainY, params=p, x_val = valX, y_val = valY, model=transferlearn_model, experiment_name='', seed = 42) ``` ## Evaluation metric for tuning Initial, our metric of choice was the F1-score, since our dataset was imbalanced and it was important to classify normal cases as well as COVID cases well, but Keras removed their F1-scores. This is a widely discussed topic online [**(source)**](https://stackoverflow.com/questions/43547402/how-to-calculate-f1-macro-in-keras). Basically these are all global metrics that were approximated batch-wise, which is more misleading than helpful. This was mentioned in the docs but it's much cleaner to remove them altogether. It was a mistake to merge them in the first place. This is mentioned in the [**release notes**](https://github.com/keras-team/keras/wiki/Keras-2.0-release-notes) of Keras. To compare our models the validation accuracy is used, which is a metric that is well comparable and gives good indication into what parameters work. After finding good models the confusion matrices, precision, recall, F1-score, specificity and sensitivity are all calculated and discussed for both the validation and test set. ## Results of the gridsearch In this section the results of the hyperparameter tuning are presented and discussed. After doing **72 runs per model**, we obtained varying results. First, several graphs are plotted to visualize which parameters work well. Then, for both the ImageNet-model and the CheXNet-model the best models found after max. 50 epochs are displayed in the tables below. This was only done for a maximum of 50 epochs, since after 50 epochs it is already possible to identify what models work well. There were 4 competing models for ImageNet and 3 for CheXNet. **These best models all had the same validation accuracy.** At the end of this blogpost, after the discussion the top 25 models of both ImageNet and CheXNet can be found in the appendix. In the ImageNet and CheXNet graphs below, the performance of the different models on both pretrained networks are plotted. The parameters batch size, learning rate, dropout and the used optimizer are varied to see which model performs well in general. The average is taken over the different amount of epochs that were tried for these parameters and the standard deviation is shown by the error bars. First of all, it can be clearly seen that - no matter the other parameters - **Adam outperforms SGD**. This becomes clear from oberserving that all the top graphs are outperforming the bottom graphs. Secondly, it becomes clear that **the lowest learning rate is never optimal**. The second largest and largest learning rate are always better with the former also often being outperformed by the latter. Furthermore, it can also be seen that dropout does not really influence the performance in terms of validation accuracy. This becomes clear from observing how all the left graphs are equal to the right graphs. ### ImageNet-model <a href=https://i.imgur.com/gp7PmY2.jpg>![graph](https://i.imgur.com/gp7PmY2.jpg)</a> ### CheXNet-model <a href=https://i.imgur.com/MwEgAfh.jpg>![graph](https://i.imgur.com/MwEgAfh.jpg)</a> From the previous graphs we could conclude that dropout did not seem to matter and that there were no clear trends in batch size. Therefore two more specific graphs were generated, which can be seen below. In these two boxplot graphs, the validation accuracy is plotted versus the learning rate for 4 different legend entries in which the optimzer and the number of epochs is shown. ![](https://i.imgur.com/WOjS0oF.png) ![](https://i.imgur.com/lVcmrsS.png) In these plots it can be, even more clearly, observed that the Adam optimizer is superior to the SGD optimizer and that the best validation accuracy found is more often found within 50 epochs. Thus, these are parameters that we tested with more extensively. When considering all models in these graphs, the accuracy obtained using the ImageNet-model is higher than the obtained accuracy by the CheXNet-model. The tables below show the best performing settings of the entire gridsearch for both of the models. **Top 4 models using the ImageNet-model** | epochs | val. loss | val. accuracy | val. F1-score | batch size | dropout | learning rate | optimizer | |--------------|----------|----------|----------|----------|--------------|--------------|------------|---------|-----------------|---------------------|-------|-----------| | 25 | 0.157784 | 0.966667 | 0.745726 | 64 | 0 | 0.001 | Adam | | 50 | 0.188933 | 0.966667 | 0.816612 | 16 | 0 | 0.001 | Adam | | 50 | 0.193035 | 0.966667 | 0.801801 | 32 | 0 | 0.001 | Adam | | 50 | 0.154482 | 0.966667 | 0.806118 | 64 | 0 | 0.001 | Adam | As you can see in the above table, the best model had a batch size of 64, no dropout, 25 epochs, a learning rate of 0.001 and used the Adam optimizer. The fourth place has the same parameters, but ran for more epochs. It becomes clear that in the top 4 the learning rate, optimizer and no-dropout were constant and that the amount of epochs and batch-size differed. **Top 3 models using the CheXnet-model** | epochs | val. loss | val. accuracy | val. F1-score | batch size | dropout | learning rate | optimizer | |--------------|-------------|-------------|-------------|-------------|--------------|--------------|------------|---------|--------|-----------------|---------------------|--------|----------------------------------------------------------| | 50 | 0.287037 | 0.899999 | 0.385737 | 16 | 0.5 | 0.0001 | Adam | | 50 | 0.260075 | 0.899999 | 0.435801 | 16 | 0 | 0.0001 | Adam | | 50 | 0.238133 | 0.899999 | 0.506174 | 32 | 0 | 0.001 | Adam | The best model in the above table had a batch size of 16, used dropout, 50 epochs, a learning rate of 0.0001 and used the Adam optimizer. From this table, only one clear optimal choice becomes noticable. That is the choice of the Adam optimzer, since it is the only varying parameter that is present all across the top three. # Results In this results section, we will present the results obtained by the best models found during hyperparameter tuning. These are the models that gave the highest valuation accuracy. Graphs showing the training process are presented and discussed and the performance of these models on the test set will be discussed as well. This is done for both transfer learned models. ## ImageNet-model In the graph below the performance of the model using transfer learning on ImageNet can be seen. The model is first trained on ImageNet and then trained on our COVID dataset. As you can see from the increasing valuation loss, the model seems to be overfitting. ![](https://i.imgur.com/2QCcke1.png) Since the above graph is overfitting, a new test is done and the best result is obtained after 25 epochs. This graph is presented below. Here you can see that as soon as validation loss increases, training is stopped. This indicates a good amount of epochs. ![](https://i.imgur.com/z9q4Yyn.png) **Results on the validation set:** | | precision | recall | F1-score | support | |--------------|-----------|--------|----------|---------| | covid | 1.00 | 0.71 | 0.83 | 28 | | normal | 0.97 | 1.00 | 0.99 | 272 | | | | | | | | accuracy | | | 0.97 | 300 | | macro avg | 0.99 | 0.86 | 0.91 | 300 | | weighted avg | 0.97 | 0.97 | 0.97 | 300 | **Confusion Matrix:** <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-wt8g{background-color:#34cdf9;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-cfoz{background-color:#3166ff;color:#000000;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-baqh{text-align:center;vertical-align:top} .tg .tg-eh07{background-color:#3166ff;font-weight:bold;text-align:center;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-baqh"></th> <th class="tg-baqh">Predicted Covid</th> <th class="tg-baqh">Predicted Normal</th> </tr> </thead> <tbody> <tr> <td class="tg-baqh">Actual Covid</td> <td class="tg-cfoz">20</td> <td class="tg-wt8g">8</td> </tr> <tr> <td class="tg-baqh">Actual Normal</td> <td class="tg-wt8g">0</td> <td class="tg-eh07">272</td> </tr> </tbody> </table> - Accuracy: 0.973 - Sensitivity: 0.7143 - Specificity: 1.000 **Results on the test set:** | | precision | recall | F1-score | support | |--------------|-----------|--------|----------|---------| | covid | 0.92 | 0.81 | 0.86 | 27 | | normal | 0.98 | 0.99 | 0.99 | 273 | | | | | | | | accuracy | | | 0.98 | 300 | | macro avg | 0.95 | 0.90 | 0.92 | 300 | | weighted avg | 0.98 | 0.98 | 0.98 | 300 | **Confusion Matrix:** <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-wt8g{background-color:#34cdf9;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-cfoz{background-color:#3166ff;color:#000000;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-baqh{text-align:center;vertical-align:top} .tg .tg-eh07{background-color:#3166ff;font-weight:bold;text-align:center;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-baqh"></th> <th class="tg-baqh">Predicted Covid</th> <th class="tg-baqh">Predicted Normal</th> </tr> </thead> <tbody> <tr> <td class="tg-baqh">Actual Covid</td> <td class="tg-cfoz">22</td> <td class="tg-wt8g">5</td> </tr> <tr> <td class="tg-baqh">Actual Normal</td> <td class="tg-wt8g">2</td> <td class="tg-eh07">271</td> </tr> </tbody> </table> - Accuracy: 0.977 - Sensitivity: 0.81 - Specificity: 0.993 As you can see, the accuracies on both the validation set as well as on the test set are quite good, 0.973 and 0.977 respectively. It is interesting to see that the precision on the validation set is 1, this means that it never mistakes a non-COVID sample for a COVID sample. This seems like the obvious decision for the model, since there are many more non-COVID samples than COVID samples. However, this imbalance in the samples should not be the cause of this, since we have balanced the class weights. In the validation set the recall of Covid/sensitivity of the confusion matrix was only 0.71. Ideally this metric should be higher, since it is a costly mistake to misclassify Covid samples. This means Covid-19 infected people get diagnosed as not being infected. The results on the test set are better than those on the validation set. Not only is the accuracy higher, but the sensitivity here is 0.81. This means less Covid patients are wrongly diagnosed. ## CheXNet-model Let us now discuss the results for the CheXNet-model, which is pre-trained on ImageNet and CheXpert. Therefore, this model is pre-trained on a task which is comparable to our Covid detection goal. The graph below shows the training process of training this model on our Covid dataset. At first the validation loss is lower than the training loss, as it should be. After more epochs, the validation loss is touching the training loss graphline. The same phenomenon is happening at the top of the graph with the accuracy. Eventually, the lines at the top and at the bottom both cross multiple times in a row and trainig is stopped. This behaviour could indicate that we are close to overfitting and trained for the right amount of epochs. However, to be sure, this model should still be trained for more epochs to see if this is true. ![](https://i.imgur.com/S3wlE3v.png) **Results on the validation set:** | | precision | recall | F1-score | support | |--------------|-----------|--------|----------|---------| | covid | 0.47 | 0.75 | 0.58 | 28 | | normal | 0.97 | 0.91 | 0.94 | 272 | | | | | | | | accuracy | | | 0.90 | 300 | | macro avg | 0.72 | 0.83 | 0.76 | 300 | | weighted avg | 0.93 | 0.90 | 0.91 | 300 | **Confusion Matrix:** <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-wt8g{background-color:#34cdf9;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-cfoz{background-color:#3166ff;color:#000000;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-baqh{text-align:center;vertical-align:top} .tg .tg-eh07{background-color:#3166ff;font-weight:bold;text-align:center;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-baqh"></th> <th class="tg-baqh">Predicted Covid</th> <th class="tg-baqh">Predicted Normal</th> </tr> </thead> <tbody> <tr> <td class="tg-baqh">Actual Covid</td> <td class="tg-cfoz">21</td> <td class="tg-wt8g">7</td> </tr> <tr> <td class="tg-baqh">Actual Normal</td> <td class="tg-wt8g">24</td> <td class="tg-eh07">248</td> </tr> </tbody> </table> - Accuracy: 0.897 - Sensitivity: 0.750 - Specificity: 0.912 **Results on the test set:** | | precision | recall | F1-score | support | |--------------|-----------|--------|----------|---------| | covid | 0.62 | 0.85 | 0.75 | 27 | | normal | 0.98 | 0.95 | 0.97 | 273 | | | | | | | | accuracy | | | 0.94 | 300 | | macro avg | 0.80 | 0.90 | 0.84 | 300 | | weighted avg | 0.95 | 0.94 | 0.94 | 300 | **Confusion Matrix:** <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-wt8g{background-color:#34cdf9;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-cfoz{background-color:#3166ff;color:#000000;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-baqh{text-align:center;vertical-align:top} .tg .tg-eh07{background-color:#3166ff;font-weight:bold;text-align:center;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-baqh"></th> <th class="tg-baqh">Predicted Covid</th> <th class="tg-baqh">Predicted Normal</th> </tr> </thead> <tbody> <tr> <td class="tg-baqh">Actual Covid</td> <td class="tg-cfoz">23</td> <td class="tg-wt8g">4</td> </tr> <tr> <td class="tg-baqh">Actual Normal</td> <td class="tg-wt8g">14</td> <td class="tg-eh07">259</td> </tr> </tbody> </table> - Accuracy: 0.940 - sensitivity: 0.852 - specificity: 0.949 It is again interesting to see that the accuarcy on the test set, 0.940, was higher than the accuracy of 0.897 on the validation set. Furthermore, the sensitivity on the test set, 0.852, is also higher than the sensisitivity on the validation set. The same holds for the F1-score of Covid, which is also noticeably higher than the F1-score on the validation set. ## Comparison of test-set results When the results of the test-sets are compared, the best ImageNet-model is performing better considering the metrics accuracy and F1-score. The Covid F1-score on the testset with the ImageNet-model is 0.86, compared to a F1-score of only 0.75 with the CheXnet-model. The accuracy on the testset of the ImageNet-model is 0.977 compared to 0.940 on with the CheXnet-model. Thus, these results indicate that in general transfer learning on ImageNet works better than transfer learning on the CheXpert dataset. ## Reproducability Note that it might not be possible to reproduce all experiments with the exact same figures. This is caused by the non-deterministic behaviour of training in Keras. There are multiple elements which behave non deterministically. First of all, dropout and dense layers have some non-deterministic behaviour. This can be fixed by setting random seeds, as we have done in our code. However, when training on GPU, results can still be non-determinstic because of the parallel processing of operations. More information on this topic can be found [**here**](https://keras.io/getting_started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development). By downloading the dataset and files from our GitHub, you should be able to obtain comparable results! Fun fact about seeds: Why is the number 42 often used as random seed: This number is used as the answer to life in the book 'The Hitchhiker's Guide to the Galaxy' and is often used in online courses [*(source)*](https://www.quora.com/Why-do-we-choose-random-state-as-42-very-often-during-training-a-machine-learning-model). # Conclusion Following our research question: *“Is transfer learning from a network that has been trained on data from the same specific domain - for example X-ray images - better than a network trained on a larger, more general dataset, such as ImageNet?”* Using the ImageNet pretrained DenseNet-121 (ImageNet-model) and the ChexNet based on DenseNet-121 (DenseNet-model) we have provided results on the detection of COVID-19 in X-ray images. We believe these results are comparable because both of these networks have the same architecture and are both optimized using extensive hyperparameter tuning. The results show that by pretraining on a more general dataset a neural network will provide a better accuracy and F1-score then when pretrained on the same specific domain. However, more research is needed as will be discussed below. Furthermore, we can conclude that our results are comparable to the results obtained by Manu Joseph. Our F1-score of 0.86 for the COVID class is comparable to the F1-score of 0.87 of the DenseNet-121 based model as shown in the section 'Related Work'. # Discussion [Minyoung Huh et al.](https://https://arxiv.org/pdf/1608.08614.pdf) previously investigated why ImageNet-trained deep features are as good as they are. They found that networks trained on ImageNet can be used well for transfer learning. Our overall findings suggest a similar result. In our empirical investigation, the results suggested that ImageNet was a better dataset as a base for transfer learning than the ChestXnet dataset. Possible explanations are the importance of the number of classes or the balance between images-per-class and classes. As mentioned in the conclusion, the results of Manu Jospeh - who tried a similar approach of using deep learning and transfer learning for Covid-19 detection in X-ray images - were only slightly better than the results obtained by us. However, these results are not directly comparable since we use different datasets. As shown below, the dataset used by Manu Joseph has a test set which only contains 10 Covid-19 images. This might be too little as missclassifying one more image would result in large changes in the obtained F1-scores. ![](https://i.imgur.com/E1UrQNk.png) [Image Source](https://towardsdatascience.com/does-imagenet-pretraining-work-for-chest-radiography-images-covid-19-2e2d9f5f0875) *Sidenotes to our results:* * Due to the lack of computational resources and scope for this research the CheXNet model trained with a large amount of epochs, 100+, has not been calculated. Such a model could provide even better results, however it is also prone to overfitting. * More possible ways to improve our model: * More aggresive data augmentation * Use of L1 and L2 regularization (also known as "weight decay") * Fine-tuning the networks, either fully or partly * Due to some duplicates of patients in our dataset it is possible that the same patient ends up in the training and in the test set. This might influence our results. * As the CheXNet paper shows the labels from the CheXpert dataset are not 100% accurate. However, for such a large dataset this is the best dataset that is available. * A critical note by [Souradip Chakraborty](https://towardsdatascience.com/detection-of-covid-19-presence-from-chest-x-ray-scans-using-cnn-class-activation-maps-c1ab0d7c294b): *"I have seen in some analysis, people have combined the normal and pneumonia cases which I don’t find appropriate as the model will then try to ignore the between-group variance amongst those two classes and the accuracy thus obtained won’t be a true measure."* # Appendix #### Fine tune code In the transfer learning and fine-tuning section, fine-tuning is explained, but the code was ommitted. When you want to include fine-tuning to build on top of this work. The code snippet below should be pasted in the train_covid19.py file below line 158 after the original transfer learn model. ```python= #Now we will fine tune the network, # first make the whole network trainable set. for layer in baseModel.layers: layer.trainable = True #FINETUNING #The learing rate very low and then recompile the model #and set the tune epochs, they should also be low to avoid overfitting. learning_rate_tune = 1e-05 epoch_tune = 30 # compile our model print("[INFO] compiling model...") opt = Adam(lr=learning_rate_tune, decay=learning_rate_tune / epoch_tune) model.compile(loss="binary_crossentropy", optimizer=opt, metrics=["accuracy"]) # train the head of the network print("[INFO] training head...") H2 = model.fit_generator( trainAug.flow(trainX, trainY, batch_size=BS), steps_per_epoch=len(trainX) // BS, validation_data=(valX, valY), validation_steps=len(valX) // BS, epochs=epoch_tune, class_weight = d_class_weights) ``` #### Top 25 Imagenet transfer learned models with parameters | round_epochs | loss | accuracy | f1_score | val_loss | val_accuracy | val_f1_score | batch_size | dropout | epochs | last_activation | loss | lr | optimizer | |--------------|----------|----------|----------|----------|--------------|--------------|------------|---------|--------|-----------------|---------------------|--------|-----------| | 25 | 0.050346 | 0.980583 | 0.74285 | 0.157784 | 0.966667 | 0.745726 | 64 | 0 | 25 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.025707 | 0.990686 | 0.815534 | 0.188933 | 0.966667 | 0.816612 | 16 | 0 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.018955 | 0.993133 | 0.80055 | 0.193035 | 0.966667 | 0.801801 | 32 | 0 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.039934 | 0.986761 | 0.805013 | 0.154482 | 0.966667 | 0.806118 | 64 | 0 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.021309 | 0.990686 | 0.824394 | 0.223829 | 0.96 | 0.825315 | 16 | 0 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.052604 | 0.983912 | 0.705183 | 0.172752 | 0.96 | 0.706772 | 16 | 0.5 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.105112 | 0.960203 | 0.700455 | 0.165102 | 0.96 | 0.701333 | 16 | 0.5 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.068306 | 0.975287 | 0.684064 | 0.146528 | 0.96 | 0.685933 | 64 | 0.5 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.067409 | 0.975445 | 0.734919 | 0.145838 | 0.953333 | 0.73589 | 16 | 0 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.088002 | 0.966524 | 0.689139 | 0.1691 | 0.953333 | 0.690365 | 32 | 0 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.040974 | 0.992056 | 0.76348 | 0.211496 | 0.953333 | 0.76471 | 64 | 0 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.081702 | 0.966461 | 0.649169 | 0.179863 | 0.953333 | 0.650593 | 64 | 0.5 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 25 | 0.034936 | 0.984759 | 0.748219 | 0.223091 | 0.946667 | 0.750742 | 16 | 0 | 25 | softmax | binary_crossentropy | 0.001 | Adam | | 25 | 0.126799 | 0.955123 | 0.60698 | 0.174798 | 0.946667 | 0.610278 | 16 | 0.5 | 25 | softmax | binary_crossentropy | 0.0005 | Adam | | 25 | 0.166124 | 0.937339 | 0.597693 | 0.155854 | 0.946667 | 0.600324 | 32 | 0.5 | 25 | softmax | binary_crossentropy | 0.0005 | Adam | | 25 | 0.126512 | 0.947926 | 0.5837 | 0.157063 | 0.946667 | 0.586976 | 64 | 0.5 | 25 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.082887 | 0.964807 | 0.694462 | 0.210249 | 0.946667 | 0.695674 | 32 | 0.5 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.05791 | 0.974249 | 0.710851 | 0.199558 | 0.946667 | 0.711908 | 32 | 0.5 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 25 | 0.075853 | 0.972532 | 0.72371 | 0.195727 | 0.94 | 0.725708 | 32 | 0 | 25 | softmax | binary_crossentropy | 0.0005 | Adam | | 25 | 0.160947 | 0.940865 | 0.542317 | 0.172528 | 0.94 | 0.54583 | 64 | 0.5 | 25 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.118671 | 0.953222 | 0.639064 | 0.172585 | 0.94 | 0.640433 | 64 | 0 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.142007 | 0.940728 | 0.55227 | 0.152683 | 0.933333 | 0.553479 | 16 | 0.5 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.224797 | 0.917917 | 0.462917 | 0.232941 | 0.933333 | 0.464749 | 64 | 0.5 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.158593 | 0.95597 | 0.59361 | 0.172896 | 0.933333 | 0.59499 | 16 | 0 | 50 | softmax | binary_crossentropy | 0.0005 | SGD | | 25 | 0.249306 | 0.917019 | 0.45587 | 0.250442 | 0.926667 | 0.459565 | 16 | 0.5 | 25 | softmax | binary_crossentropy | 0.0001 | Adam | #### Top 25 ChestXNet transfer learned models with parameters | round_epochs | loss | accuracy | f1_score | val_loss | val_accuracy | val_f1_score | batch_size | dropout | epochs | last_activation | loss | lr | optimizer | |--------------|----------|----------|----------|----------|--------------|--------------|------------|---------|--------|-----------------|---------------------|--------|-----------| | 25 | 0.050346 | 0.980583 | 0.74285 | 0.157784 | 0.966667 | 0.745726 | 64 | 0 | 25 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.025707 | 0.990686 | 0.815534 | 0.188933 | 0.966667 | 0.816612 | 16 | 0 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.018955 | 0.993133 | 0.80055 | 0.193035 | 0.966667 | 0.801801 | 32 | 0 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.039934 | 0.986761 | 0.805013 | 0.154482 | 0.966667 | 0.806118 | 64 | 0 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.021309 | 0.990686 | 0.824394 | 0.223829 | 0.96 | 0.825315 | 16 | 0 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.052604 | 0.983912 | 0.705183 | 0.172752 | 0.96 | 0.706772 | 16 | 0.5 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.105112 | 0.960203 | 0.700455 | 0.165102 | 0.96 | 0.701333 | 16 | 0.5 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.068306 | 0.975287 | 0.684064 | 0.146528 | 0.96 | 0.685933 | 64 | 0.5 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.067409 | 0.975445 | 0.734919 | 0.145838 | 0.953333 | 0.73589 | 16 | 0 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.088002 | 0.966524 | 0.689139 | 0.1691 | 0.953333 | 0.690365 | 32 | 0 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.040974 | 0.992056 | 0.76348 | 0.211496 | 0.953333 | 0.76471 | 64 | 0 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.081702 | 0.966461 | 0.649169 | 0.179863 | 0.953333 | 0.650593 | 64 | 0.5 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 25 | 0.034936 | 0.984759 | 0.748219 | 0.223091 | 0.946667 | 0.750742 | 16 | 0 | 25 | softmax | binary_crossentropy | 0.001 | Adam | | 25 | 0.126799 | 0.955123 | 0.60698 | 0.174798 | 0.946667 | 0.610278 | 16 | 0.5 | 25 | softmax | binary_crossentropy | 0.0005 | Adam | | 25 | 0.166124 | 0.937339 | 0.597693 | 0.155854 | 0.946667 | 0.600324 | 32 | 0.5 | 25 | softmax | binary_crossentropy | 0.0005 | Adam | | 25 | 0.126512 | 0.947926 | 0.5837 | 0.157063 | 0.946667 | 0.586976 | 64 | 0.5 | 25 | softmax | binary_crossentropy | 0.001 | Adam | | 50 | 0.082887 | 0.964807 | 0.694462 | 0.210249 | 0.946667 | 0.695674 | 32 | 0.5 | 50 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.05791 | 0.974249 | 0.710851 | 0.199558 | 0.946667 | 0.711908 | 32 | 0.5 | 50 | softmax | binary_crossentropy | 0.001 | Adam | | 25 | 0.075853 | 0.972532 | 0.72371 | 0.195727 | 0.94 | 0.725708 | 32 | 0 | 25 | softmax | binary_crossentropy | 0.0005 | Adam | | 25 | 0.160947 | 0.940865 | 0.542317 | 0.172528 | 0.94 | 0.54583 | 64 | 0.5 | 25 | softmax | binary_crossentropy | 0.0005 | Adam | | 50 | 0.118671 | 0.953222 | 0.639064 | 0.172585 | 0.94 | 0.640433 | 64 | 0 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.142007 | 0.940728 | 0.55227 | 0.152683 | 0.933333 | 0.553479 | 16 | 0.5 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.224797 | 0.917917 | 0.462917 | 0.232941 | 0.933333 | 0.464749 | 64 | 0.5 | 50 | softmax | binary_crossentropy | 0.0001 | Adam | | 50 | 0.158593 | 0.95597 | 0.59361 | 0.172896 | 0.933333 | 0.59499 | 16 | 0 | 50 | softmax | binary_crossentropy | 0.0005 | SGD | | 25 | 0.249306 | 0.917019 | 0.45587 | 0.250442 | 0.926667 | 0.459565 | 16 | 0.5 | 25 | softmax | binary_crossentropy | 0.0001 | Adam |

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.