owned this note
owned this note
Published
Linked with GitHub
# Notes [On Interpretability of Deep Learning based Skin Lesion Classifiers using Concept Activation Vector](https://arxiv.org/pdf/2005.02000.pdf)
###### tags: `notes` `medical image analysis` `skin lesion classification` `concept activation vectors`
## Brief Outline
---
The paper states that current methods of computer-aided diagnosis are not widely accepted due to their obscure nature. The main aim of the this paper is to design a deep learniing model that will be trained to make decisions similar to the medical experts. The main datasets chosen are basically skin disease datasets which are PH2 and derm7pt. Concept activation vectors are used to map human understandable to RECOD images. The results shows that the classifier learns and encodes human understandable concepts in its latent representation.
## Introduction
---
* The trust on AI model for medical diagnosis has been dubious due to the fact that the process behind learning and encoding features in latent space by computer models is not very well understood.
* The manual way of disease classification requires to grasp micro level features along with macro level concepts which often require good medical experience.
* Related works in this field include visualisation of saliency maps which work well on common object detection tasks but fail on complex medical image analysis tasks.
* In this paper, the task of classification of skin disease classification is used for understanding what neural networks learn.
* The authors attempt to understand if the concepts learnt by classifiers in complex Medical Image Analysis (MIA) tasks are similar to those used by dermatologists.
* In summary, the paper presents the following:
a) A training and testing paradigm for Concept Activation Vectors (CAVs) using identically distributed data.
b) Mapping the concepts digested by the deep learning model in the form of latent vectors to human understandable information using the CAVs.
c) Examining contributions of different dermoscopic criteria to the predictions of deep models, revealing agreement between reasoning process of doctors and deep models.
## Related work
To interpret the decisions made by the neural network models, the following types are usually employed:
### Saliency-Based Neural Network Explanations
* These were one of the earliest methods used to explain decisions taken by AI.
* Some examples for this method are GradCAM, SmoothGrad, Integrated Gradient, and Layer-Wise Relevance Propagation.
* These provide importance maps on a local scale.
* Even though these methods are very successful on numerous datasets but don't seem to work well on derma - datasets.
* This might partly be due to the innate difficulty in skin lesion classification that mandates huge amount of expert knowledge to recognize complex and subtle structures.
* Another reason could be due to a large variation in fine nuances of these structures that are hard to discern yet can drastically change diagnosis.
* In continuation to the above point, the visual features related to diseases in skin lesion images are usually scattered all over the image.
### Text Based Neural Network Explanations
* Textual explanation methods for neural networks can be either template-based or rule-based.
* The [MDNet](https://arxiv.org/pdf/1707.02485.pdf) paper proposed a unified network following a rule-based approach that generates diagnostic reports along with corresponding attention maps of input images in order to increase the semantic and visual interpretability of MIA task at hand.
### Concept-Based Neural Network Explanations
* These basically makes the latent representations of neural networks interpretable for humans.
* The concept of CAVs was first introduced in [Kim et al.](https://arxiv.org/abs/1711.11279), it was basically used to map human understandable concepts to the latent space which are extracted from various domains.
* With the help of directional derivatives, the influence of a concept to make a certain prediction is evaluated.
* Other notable works in this field include the paper by [Zhou et al.](https://openaccess.thecvf.com/content_ECCV_2018/papers/Antonio_Torralba_Interpretable_Basis_Decomposition_ECCV_2018_paper.pdf) which is aimed at decomposing neural networks’ activations into semantically meaningful components.
* [Ghorbani et al.](https://arxiv.org/pdf/1902.03129v1.pdf) developed a method for unsupervised clustering of object datasets by first applying segmentation of single objects and then clustering activations of object patches into semantically meaningful clusters.
* The author states that CAVs were not previously explored for skin lessions classification.
* Stating this, it is also not possible for directly using any of the above mentioned related works due to overlap of spatial concepts.
* The method proposed by [Zhou et al.](https://openaccess.thecvf.com/content_ECCV_2018/papers/Antonio_Torralba_Interpretable_Basis_Decomposition_ECCV_2018_paper.pdf) requires a concept corpus which is not readily available for skin lession classification.
* In this paper, the concept of TCAV is used as its backbone.
* Instead of using general, out-of-distribution concept patches, the authors train CAVs using samples from identically distributed datasets to map human-understandable concepts to the network’s latent space.
## Background
### Concept activation vectors
* These were first introduced in the paper by [Kim et al.](https://arxiv.org/pdf/1711.11279.pdf).
* A CAV is a vector which is perpendicular to the decision boundary obtained when a binary classifier is trained on the latent space.
* TCAV score: It is a metric that is used to estimate the influence of a CAV on a class of input images.
* It basically makes use of directional derivatives $S_{C,k,l}(x)$ to measure the contextual sensitivity of a concept towards an entire input class, therefore providing global explanations.
* The TCAV score is calculated by:
$TCAV_{QC,k,l} = \frac{|x\in X_{k}: S_{C, k, l}(x)>0|}{|X_{k}|}$
* As compared to saliency maps or other per-feature metrics, the TCAV score allows for quantitative evaluation of concepts on whole input classes.
### Dermoscopic Concepts used for Analysis
* The concepts used are in accordance with the standardized terminology agreed upon by expert dermatologist in 3rd Consensus Conference of the International Society of Dermoscopy(IDS).
* Some of these are: a) Pigment networks b) Streaks c) Regression structures d) Dots and globules e) Blue and whitish veils f) Asymmetric g) Colour, the details on these concepts can be found in the paper.
## Materials and Method
### Model
The model used by the authors is developed by RECOD Lab, Brazil as a part of ISBI 2017 challenge (Code: [Link](https://github.com/learningtitans/isbi2017-part3)). They used transfer learning combined with extensive ensembling using an SVM meta - layer on top of the base models. There were seven base models, three on Inception based on deploy (one of which was used in this paper as base model), three on Inception based on semi and one on ResNet trained on semi. More details can be found [here](https://arxiv.org/pdf/1703.04819.pdf). The authors used this model directly for explainability. The used set has 9640 images using per-image normalization.
### Datasets
#### Concept Training
1. PH$^2$: 200 dermoscopic images of melanocyric lesions, 80 common naevi, 80 atypical naevi and 40 melanomas. Colour and lesion segmentation masks are provied with proper annotations.
2. derm7pt (Seven - Point Checklist Dermatology Dataset): It has 1,011 clinical and dermoscopic images. Samples are assignmed either a miscellaneous class or one of 4 diagnosis classes. Two of those (Melanoma and Naevi) are further divided to 13 classes. The paper utilizes only these two samples, therefore a total of 823 images were used.
#### ISBI 2017 Challenge Dataset
Trainset of ISBI 2017 challenge has 1372 Naevi(NV) samples, 374 Melanoma(MEL) samples and 254 Seborrheic Keratosis(SK) samples. Test set has 393 images on NV, 117 images on MEL and 90 images on SK.
NOTE: Random concept labels are assigned to ISIC archive images, excluding MEL and NV classes because remaining samples hardly contain concepts similar to the ones that are used in concept training.
### Experimental Setup
![](https://i.imgur.com/hKSdq86.png)
1. Step 1: Input image is passed through the trained Inception v4 (Paper: [Link](https://arxiv.org/pdf/1602.07261v1.pdf)) base models explained before. Activations are extracted from mixed_6h layer. The activations along with concept annotation are passed to binary concept training. A clustering based undersampling technique as well as stratified splitting is done (for even balancing of training and validation sets).
3. Step 2: The ground truth label is used for calculating the gradient w.r.t. it. Along with CAV, it can be used to calculate TCAV score to evaluate concept importance to specific target class. To incorporate differences in preprocessing and classifier initialization, each classifier training is repeated 20 times on randomly sampled dataset split.
50 random CAVs are trained per layer. Random datasets are produced by sampling 100 random images repeatedly from ISIC dataset. The distribution of actual TCAV and random concept TCAV scores are compared by two sided t - test, with $\alpha = 0.05$.
### Results and Analysis
#### Classification Accuracy
![](https://i.imgur.com/ZevjCXb.png)
Each bar represents classification accuracy for a concept, with standard deviation. Red line depicts mean baseline results from training over 50 random concept subsets.
Inference from results: Network's latent space is structured in a way that allows activation's separation with respect to similar concepts.
#### TCAV Scores
![](https://i.imgur.com/ksOF5mg.png)
![](https://i.imgur.com/IGLvvDg.png)
Values above 0.5 represent positive influence, whereas lower values indicate negative influence.