# PATHVQA: 30000+ QUESTIONS FOR MEDICAL VISUAL QUESTION ANSWERING >[paper link](https://arxiv.org/pdf/2003.10286.pdf) ## Introduction - The paper aims to create a pathology visual question answering (VQA) dataset that contains questions similar to those in the ABP tests(a medical certification exam in US). - Medical images are highly domain-specific, which can only be interpreted by medical professionals. It is very difficult and expensive to hire them to help and create medical VQA datasets. - Also, medical images are very difficult to obtain due to privacy concerns. - To counter these challenges, the authors resort to freely accessible pathology textbooks as well as online digital libraries. - These sources help provide image+captions, the latter can then be used to generate questions. - They have developed a semi-automated pipeline to generate question-answer pairs from each caption, and then manually checked for grammatical issues. - The major contibutions of this paper include : - They have created a pathology visual question answering (VQA) dataset containing 4998 pathology images and 32,799 question-answer pairs, the first dataset for pathology VQA. - Developed an semi-automated pipeline to efficiently create medical VQA datasets from medical textbooks and online digital libraries. - They apply several state-of-the-art VQA methods to their dataset and generate a set of baseline results for other researchers to benchmark with. ## Related Works - ### Datasets - There are 2 existing datasets for medical Visual Question Answering. - *VQA_MED* dataset is created on 4,200 radiology images and has 15,292 question-answer pairs with 4 categories of clinical question - modality, plane, organ system, and abnormality. - *VQA-RAD* is manually crafted dataset where question and answer are given by clinicians on radiology images and has 3515 questions of 11 types. - PathVQA Dataset consists of pathology images and most questions are open-ended. - Also greater number of questions. ## Dataset Collection - The automated dataset collection consists of 2 steps : - Extracting images an caption from textbooks and website. - Generating question-answer pair from captions. - The generated pairs are checked manually for grammatical errors. - ### Extracting pathology Images : - Given the pdf of textbook, PyPDF and PDFMiner are used to extract images and captions. - They use regular expressions to search for snippets with prefixes of “Fig." or “Figure", followed by figure number and caption text. - Given online pathology digital library such as PEIR, they use third-party tools Requests and Beautiful Soup for the same. - ### Question Generation : - They perform natural language processing on the captions using Stanford CoreNLP toolkit. - Perform sentence simplification to break long sentences into short ones. - Given the POS tags and named entities of the simplified sentences, they generate questions for them using question words like 'when', 'where', 'how' etc. - A question transducer mainly contains 3 steps : - First, they perform the main verb decomposition based on the tense of the verb. For instance, we decompose “shows" to “does show". - Second, they invert the subject and the auxiliary verb in the declarative sentences to form the interrogative sentence. This step creates yes or no questions. - Third, they remove the target answer phrases and insert the question phrase obtained previously to generate <a href="https://wpglob.com/open-ended-questions-examples/">open ended questions examples</a> belonging to types of “what", “where", “when", “whose", “how", and “how much/how many". - The questions and answers are further cleaned by removing extra spaces and irrelevant symbols. - Questions that are too short or vague are removed. ![](https://i.imgur.com/WzZPro6.png) ## Dataset Statistics - *PathVQA* datset contains 32,799 question-answer pairs generated from 1,670 pathology images. - Textbooks used for collection - *Textbook of Pathology* and *Basic Pathology.* These gave 1,670 images. - And the rest 3,328 pathology images were collected from the PEIR digital library. - Table 3 summarises the basic statistics. ![](https://i.imgur.com/SaJNiNi.png) - There are 7 categories of questions: what, where, when, whose, how, how much/how many, and yes/no. - Following table gives the percent of dataset alloted for each type. ![](https://i.imgur.com/rCJQMzD.png) - The questions cover various aspects of the visual contents, including color, location, appearance, shape, etc. ## Benchmark VQA performance - They apply existing SOTA level VQA models to PathVQA dataset. - ### Models - **Method 1 :** [Bilinear Attention Networks](https://arxiv.org/abs/1805.07932) - **Method 2 :** [Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding](https://arxiv.org/abs/1606.01847) - **Method 3 :** [Stacked Attention Networks for Image Question Answering](https://arxiv.org/abs/1511.02274) - ### Experimental Settings - The text goes through standard preprocessing - removing punctuation and stop words, tokenization, and converting to lower-cases. - Vocabulary consists of 2200 highest frequency words, which are then represented using GloVe pre-trained of a a corpus like Wikipedia. - Data augmentation is applied to the images, including shifting, scaling, and shearing. - Faster R-CNN is pre-trained on Visual Genome and the later two both pre-trained on ImageNet. - In *Method 1*, - the dropout rate for the linear mapping is 0.2 while for the classifier it's 0.5. - The initial learning rate is set to 0.005 while using Adamax optimizer. - The batch size was set to 512. - In *Method 2*, - Dropout was applied to the LSTM layers with a probability of 0.4. - They set the feature dimension to 2048 in multimodal compact bilinear pooling. - The optimizer was Adam with an initial learning rate of 0.0001 and a mini-batch size of 32. - In Method 3, - the number of attention layers and LSTM layers were both set to 2 - the hidden dimensionality of the LSTMs was set to 512. - The weight parameters were learned using Stochastic Gradient Descent (SGD) with a momentum of 0.9, a learning rate of 0.1, and a mini-batch size of 100. - For yes/no questions, they evaluate using accuracy. - For open-ended questions, they evaluate using 3 metrics. - [exact match](https://arxiv.org/abs/1410.0210) - [Macro-averaged F1](https://www.researchgate.net/publication/226675412_A_Probabilistic_Interpretation_of_Precision_Recall_and_F-AC) - [BLEU score](https://www.aclweb.org/anthology/P02-1040.pdf) - ### Results - The following table gives accuracy of all three methods on yes/no questions. ![](https://i.imgur.com/QQSwFoo.png) - Method 1 outperformes the rest, one of primary reasons is that is uses bottoms up method to extract region-specific visual features. - This can be further verified by comparing Method 3 + Faster R-CNN with Method 3. Method 3 + Faster R-CNN extracts region-specific features hence increased accuracy. - Besides, - the use of residual learning of attention and - the superiority of bilinear attention over other co-attention approaches also contribute to the highest accuracy of Method 1. - Also Method 3 outperforms Method 2 because it uses multiple layer attention, rather than single like Method 2. - Method 3 + ResNet works better than Method 3, due to the reason that ResNet can extract better visual features than VGGNet. - The table below shows the [exact match](https://arxiv.org/abs/1410.0210), [Macro-averaged F1](https://www.researchgate.net/publication/226675412_A_Probabilistic_Interpretation_of_Precision_Recall_and_F-AC) and [BLEU](https://www.aclweb.org/anthology/P02-1040.pdf) scores for open ended questions. ![](https://i.imgur.com/PGlMGaQ.png) - these scores are low in general, which indicatesthat this dataset is challenging for medical VQA. - Reason why it's challenging is : - most questions are open-ended. The number of possible answers is $O(V^L)$, where V is the vocabulary size and L is the expected length of answers. This also incurs the out-of-vocabulary issue, where the words in test examples may never occur in the training examples. - Also the size of PathVQA is much smaller, compared with general domain VQA datasets. ## Suggestions for imporved model - The visual features extractors are pretrained on general data which got no corelation to our dataset. - One way to deal with this is to collect publicly available pathology images, whose domain is closer to that of the images in our dataset, then pre-train the CNNs using these medical images. - Similary, pre-train the word embeddings on medical literature. ## Conclusion - They build a pathology VQA dataset that contains 32,799 question-answer pairs of 7 categories, generated from 4,998 images. - The dataset is publicly available.