# PATHVQA: 30000+ QUESTIONS FOR MEDICAL VISUAL QUESTION ANSWERING
>[paper link](https://arxiv.org/pdf/2003.10286.pdf)
## Introduction
- The paper aims to create a pathology visual question answering (VQA) dataset that contains questions similar to those in the ABP tests(a medical certification exam in US).
- Medical images are highly domain-specific, which can only be interpreted by medical professionals. It is very difficult and expensive to hire them to help and create medical VQA datasets.
- Also, medical images are very difficult to obtain due to privacy concerns.
- To counter these challenges, the authors resort to freely accessible pathology textbooks as well as online digital libraries.
- These sources help provide image+captions, the latter can then be used to generate questions.
- They have developed a semi-automated pipeline to generate question-answer pairs from each caption, and then manually checked for grammatical issues.
- The major contibutions of this paper include :
- They have created a pathology visual question answering (VQA) dataset containing 4998 pathology images and 32,799 question-answer pairs, the first dataset for pathology VQA.
- Developed an semi-automated pipeline to efficiently create medical VQA datasets from medical textbooks and online digital libraries.
- They apply several state-of-the-art VQA methods to their dataset and generate a set of baseline results for other researchers to benchmark with.
## Related Works
- ### Datasets
- There are 2 existing datasets for medical Visual Question Answering.
- *VQA_MED* dataset is created on 4,200 radiology images and has 15,292 question-answer pairs with 4 categories of clinical question - modality, plane, organ system, and abnormality.
- *VQA-RAD* is manually crafted dataset where question and answer are given by clinicians on radiology images and has 3515 questions of 11 types.
- PathVQA Dataset consists of pathology images and most questions are open-ended.
- Also greater number of questions.
## Dataset Collection
- The automated dataset collection consists of 2 steps :
- Extracting images an caption from textbooks and website.
- Generating question-answer pair from captions.
- The generated pairs are checked manually for grammatical errors.
- ### Extracting pathology Images :
- Given the pdf of textbook, PyPDF and PDFMiner are used to extract images and captions.
- They use regular expressions to search for snippets with prefixes of “Fig." or “Figure", followed by figure number and caption text.
- Given online pathology digital library such as PEIR, they use third-party tools Requests and Beautiful Soup for the same.
- ### Question Generation :
- They perform natural language processing on the captions using Stanford CoreNLP toolkit.
- Perform sentence simplification to break long sentences into short ones.
- Given the POS tags and named entities of the simplified sentences, they generate questions for them using question words like 'when', 'where', 'how' etc.
- A question transducer mainly contains 3 steps :
- First, they perform the main verb decomposition based on the tense of the verb. For instance, we decompose “shows" to “does show".
- Second, they invert the subject and the auxiliary verb in the declarative sentences to form the interrogative sentence. This step creates yes or no questions.
- Third, they remove the target answer phrases and insert the question phrase obtained previously to generate <a href="https://wpglob.com/open-ended-questions-examples/">open ended questions examples</a> belonging to types of “what", “where", “when", “whose", “how", and “how much/how many".
- The questions and answers are further cleaned by removing extra spaces and irrelevant symbols.
- Questions that are too short or vague are removed.

## Dataset Statistics
- *PathVQA* datset contains 32,799 question-answer pairs generated from 1,670 pathology images.
- Textbooks used for collection - *Textbook of Pathology* and *Basic Pathology.* These gave 1,670 images.
- And the rest 3,328 pathology images were collected from the PEIR digital library.
- Table 3 summarises the basic statistics.

- There are 7 categories of questions: what, where, when, whose, how, how much/how many, and yes/no.
- Following table gives the percent of dataset alloted for each type.

- The questions cover various aspects of the visual contents, including color, location, appearance, shape, etc.
## Benchmark VQA performance
- They apply existing SOTA level VQA models to PathVQA dataset.
- ### Models
- **Method 1 :** [Bilinear Attention Networks](https://arxiv.org/abs/1805.07932)
- **Method 2 :** [Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding](https://arxiv.org/abs/1606.01847)
- **Method 3 :** [Stacked Attention Networks for Image Question Answering](https://arxiv.org/abs/1511.02274)
- ### Experimental Settings
- The text goes through standard preprocessing - removing punctuation and stop words, tokenization, and converting to lower-cases.
- Vocabulary consists of 2200 highest frequency words, which are then represented using GloVe pre-trained of a a corpus like Wikipedia.
- Data augmentation is applied to the images, including shifting, scaling, and shearing.
- Faster R-CNN is pre-trained on Visual Genome and the later two both pre-trained on ImageNet.
- In *Method 1*,
- the dropout rate for the linear mapping is 0.2 while for the classifier it's 0.5.
- The initial learning rate is set to 0.005 while using Adamax optimizer.
- The batch size was set to 512.
- In *Method 2*,
- Dropout was applied to the LSTM layers with a probability of 0.4.
- They set the feature dimension to 2048 in multimodal compact bilinear pooling.
- The optimizer was Adam with an initial learning rate of 0.0001 and a mini-batch size of 32.
- In Method 3,
- the number of attention layers and LSTM layers were both set to 2
- the hidden dimensionality of the LSTMs was set to 512.
- The weight parameters were learned using Stochastic Gradient Descent (SGD) with a momentum of 0.9, a learning rate of 0.1, and a mini-batch size of 100.
- For yes/no questions, they evaluate using accuracy.
- For open-ended questions, they evaluate using 3 metrics.
- [exact match](https://arxiv.org/abs/1410.0210)
- [Macro-averaged F1](https://www.researchgate.net/publication/226675412_A_Probabilistic_Interpretation_of_Precision_Recall_and_F-AC)
- [BLEU score](https://www.aclweb.org/anthology/P02-1040.pdf)
- ### Results
- The following table gives accuracy of all three methods on yes/no questions.

- Method 1 outperformes the rest, one of primary reasons is that is uses bottoms up method to extract region-specific visual features.
- This can be further verified by comparing Method 3 + Faster R-CNN with Method 3. Method 3 + Faster R-CNN extracts region-specific features hence increased accuracy.
- Besides,
- the use of residual learning of attention and
- the superiority of bilinear attention over other co-attention approaches
also contribute to the highest accuracy of Method 1.
- Also Method 3 outperforms Method 2 because it uses multiple layer attention, rather than single like Method 2.
- Method 3 + ResNet works better than Method 3, due to the reason that ResNet can extract better visual features than VGGNet.
- The table below shows the [exact match](https://arxiv.org/abs/1410.0210), [Macro-averaged F1](https://www.researchgate.net/publication/226675412_A_Probabilistic_Interpretation_of_Precision_Recall_and_F-AC) and [BLEU](https://www.aclweb.org/anthology/P02-1040.pdf) scores for open ended questions.

- these scores are low in general, which indicatesthat this dataset is challenging for medical VQA.
- Reason why it's challenging is :
- most questions are open-ended. The number of possible answers is $O(V^L)$, where V is the vocabulary size and L is the expected length of answers. This also incurs the out-of-vocabulary issue, where the words in test examples may never occur in the training examples.
- Also the size of PathVQA is much smaller, compared with general domain VQA datasets.
## Suggestions for imporved model
- The visual features extractors are pretrained on general data which got no corelation to our dataset.
- One way to deal with this is to collect publicly available pathology images, whose domain is closer to that of the images in our dataset, then pre-train the CNNs using these medical images.
- Similary, pre-train the word embeddings on medical literature.
## Conclusion
- They build a pathology VQA dataset that contains 32,799 question-answer pairs of 7 categories, generated from 4,998 images.
- The dataset is publicly available.