yunfeixie
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Author Rebuttal to All Reviewers We thank all reviewers for acknowledging the contribution of this work and their constructive feedback. We are pleased that you appreciate the following:"introducing MedTrinity-25M as the largest multimodal medical dataset with multigranular annotations and an automated annotation pipeline that scales medical image-text data, and its support for improving pretrained models like CLIP and LLaVA."(**Reviewer FQpc**), "the use of an automated pipeline incorporating medical knowledge retrieval and domain-specific models to reduce manual labeling, and the dataset’s support for diverse medical tasks across multiple domains." (**Reviewer jxrU**), "substantially increasing training data scale for vision-language tasks with structured, multigranular annotations that provide superior detail compared to other datasets."(**Reviewer d16k**), "the comprehensiveness of the dataset, spanning diverse modalities, diseases, and anatomical structures, and the enrichment from advanced models that enhance annotation quality and depth." (**Reviewer DJFs**) MedTrinity-25M aims to address the pressing need for large-scale, high-quality multimodal datasets in medical AI. Our automated pipeline efficiently scales multimodal datasets in medicine, enabling large-scale pretraining of medical AI models. The dataset’s multigranular annotations further support the development of more precise and robust models. To foster progress in the field, we will release all data and code, hoping to advance the training of next-generation medical imaging foundation models. ## Reviewer FQpc ### Weakness 1 (Data Quality) > "The quality of the generated image-text data may not be sufficiently high..." Thank you for your detailed and insightful feedback. We have thoroughly reviewed the prompts and metadata provided to the MLLM for annotation generation and are confident in their accuracy. For your reference, we include our metadata for the ct_rate and quilt_1m datasets below: - ct_rate:This is a chest image from CT volume with {disease} in green bounding boxes. - quilt_1m: This is an image of histological samples of various human cancers. Each type of cell nuclei is color-coded and outlined with a bounding box. Larger bounding boxes are used to group clusters of the same type of cell nuclei. However, we believe that the observed misidentification of modalities likely originates from the MLLM itself, which may have rewritten or misinterpreted the provided metadata. This suggests a critical limitation of MLLMs that warrants further investigation, though it is beyond the scope of this paper. We will address this limitation in future work and clarify it in the paper. While we acknowledge that annotations generated by MLLMs may contain some noise compared to human annotations, we strongly assert that MedTrinity-25M is a highly valuable resource for advancing medical MLLMs. For instance, as shown in Table 3 of the original manuscript, our dataset led to performance improvements of 10.8%, 6.1%, and 8.3% on VQA-RAD, SLAKE, and PathVQA, respectively, even in the presence of some annotation noise. ### Weakness 2 (Result on MMMU) >A broader range of specialized benchmarks, such as the health subset of MMMU, could provide a more robust comparison of the multimodal large models’ performance. Thank you for your suggestion. Accordingly, we compared the performance of LLaVA-Med and the proposed LLaVA-Tri on the health and medicine subsets of MMMU. We report the micro-average accuracy scores in the following table. ||Basic Medical Science|Diagnosis and Laboratory Medicine| |:---:|:---:|:---:| ||326 samples|162 samples| |LLaVA-Tri|0.371|0.278| |LLaVA-Med|0.270|0.259| As shown in the table, LLaVA-Tri significantly outperforms LLaVA-Med in all evaluated areas, indicating the effectiveness of training the model with MedTrinity-25M. The details of this experiment are as follows: |Parameter|Value| |:---:|:---:| |temperature|0.1| |num\_beam|1| |max\_new\_tokens|128| |top\_p|None| |others|default| ### Question 1 (Choice of GPT-4V) > Why did you choose to use GPT-4V to generate the supervised fine-tuning data for the fine-tuning process? The choice of MLLMs is flexible and not the main focus of our paper. Our primary contribution lies in developing an automated pipeline to scale up multimodal data by generating multigranular annotations from unpaired image inputs. In our pipeline, MLLMs are tasked with producing multigranular annotations based on complex instructions. We selected GPT-4V for this purpose due to its strong ability to follow detailed instructions and generate comprehensive outputs [1]. However, it is important to note that while GPT-4V was chosen, the pipeline is designed to be flexible and can readily incorporate other MLLMs as needed. [1] Peng, Baolin, et al. "Instruction tuning with gpt-4." arXiv preprint arXiv:2304.03277 (2023). ### Question 2 & Question 3 (Revision of Table 3) > ...I suggested distinguishing the results presented Thanks for your suggestion. We revised the presentation of results in Table 3 of the original manuscript by the following modifications: 1. distinguishing between models that are fine-tuned on the corresponding training sets and those evaluated directly without fine-tuning. 2. categorizing the result into two groups for CLIP models and MLLMs, respectively. 3. explicitly indicating the number of fine-tuning epochs of LLaVA-Med and LLaVA-Tri (3 epoches for all fine-tuned methods). #### Method Fine-tuned on the Training Set of the VQA Benchmark **Clip-based** | Method | VQA-RAD Open | VQA-RAD Closed | VQA-RAD Avg | SLAKE Open | SLAKE Closed | SLAKE Avg | PathVQA Open | PathVQA Closed | PathVQA Avg | |--------------|--------------|----------------|-------------|------------|--------------|-----------|--------------|----------------|-------------| | PubMedCLIP | 60.1 | 80.0 | 70.1 | 78.4 | 82.5 | 80.5 | - | - | - | | BiomedCLIP | 67.6 | 79.8 | 73.7 | 82.1 | 89.7 | 85.9 | - | - | - | **Non Clip-based** | Method | VQA-RAD Open | VQA-RAD Closed | VQA-RAD Avg | SLAKE Open | SLAKE Closed | SLAKE Avg | PathVQA Open | PathVQA Closed | PathVQA Avg | |------------------------|--------------|----------------|-------------|------------|--------------|------------|--------------|----------------|-------------| | VL Encoder–Decoder | 71.5 | 82.5 | 77.0 | - | - | - | 71.5 | 85.6 | 78.6 | | Q2ATransformer | 79.2 | 81.2 | 80.2 | - | - | - | 54.9 | 88.9 | 71.9 | | Prefix T. Medical LM | - | - | - | 84.3 | 82.0 | 83.2 | 40.0 | 87.0 | 63.5 | | M2I2 | 66.5 | 83.5 | 75.0 | 74.7 | 91.1 | 82.9 | 36.3 | 88.0 | 62.2 | | LLaVA | 50.0 | 65.1 | 57.6 | 78.2 | 63.2 | 70.7 | 7.7 | 63.2 | 35.5 | | LLaVA-Med (finetuned for 3 epoches) | 55.5 | 66.5 | 61.0 | 80.5 | 64.2 | 72.4 | 35.9 | 89.2 | 62.5 | | **LLaVA-Tri (finetuned for 3 epoches)** | **77.1** | **86.0** | **81.6** | **86.2** | **89.3** | **87.8** | **66.5** | **99.0** | **82.8** | **Method Not Fine-tuned on the Training Set of the VQA Benchmark** | Method | VQA-RAD Open | VQA-RAD Closed | VQA-RAD Avg | SLAKE Open | SLAKE Closed | SLAKE Avg | PathVQA Open | PathVQA Closed | PathVQA Avg | |--------------|--------------|----------------|-------------|------------|--------------|-----------|--------------|----------------|-------------| | GPT-4V | 39.5 | 78.9 | 59.2 | 33.6 | 43.6 | 38.6 | - | - | - | | LLaVA-Med | 28.2 | 61.4 | 44.8 | 39.2 | 52.2 | 45.7 | 12.3 | 54.1 | 33.2 | | **LLaVA-Tri** | **36.9** | **62.6** | **49.7** | **24.1** | **43.4** | **33.75** | **11.2** | **59.0** | **35.10** | ### Question 4 (Explanation of the "open" and "close" terms) > What do "open" and "close" specifically refer to? Thank you for your question. In the context of the Visual Question Answering (VQA) tasks in our tables, the terms "open" and "close" refer to different question-and-answer formats: - **Open**: Open-ended questions where the model generates free-form text responses without predefined answer options. - **Close**: Closed-ended questions, such as multiple-choice or yes/no questions, where the model selects from a predefined set of possible answers. We will update the table captions and the related text in the results section to clearly define these terms. ## Reviewer jxrU ### Weakness 1 & Question 4 (Scalability) > ...domain-specific models may limit scalability when handling new modalities or emerging diseases. Thank you for your insightful comment. Our pipeline is designed to be inherently scalable for generating Regions of Interest (ROIs) for medical images by leveraging freely available expert grounding models. For images from novel modalities or emerging diseases, we can adopt universal grounding models, such as the Segment Anything Model (SAM), to annotate ROI in a zero-shot manner. If needed, these models can be fine-tuned efficiently using parameter-efficient fine-tuning (PEFT) methods with only a small number of samples from new domains in a parallel manner, ensuring scalability of the grounding module to underrepresented diseases or modalities. Currently, we have opted for domain-specific models over universal models to enhance the accuracy of ROI generation. However, it is important to emphasize that this is not the central focus of our paper. Instead, the core contribution of our work lies in the development of an automated pipeline for generating multigranular visual and textual annotations from images. The central emphasis is on the scalability and versatility of the pipeline, which can seamlessly integrate either specialized or universal models as required. ### Weakness 2 & Question 1 (Data Quality) > How can a high standard of automated labeling be ensured? ... has it been validated by human experts? We ensure the quality of the generated data through our pipeline design and quantitative validation. In our pipeline design, we decompose the generation of multigranular annotations into 2 steps: 1) We collect trustworthy metadata as a query to retrieve information from a comprehensive medical knowledge base (e.g., from professional resources such as PubMed), obtaining high-quality expert knowledge. 2) We use ROIs extracted from expert domain-specific models as constraints, prompting the caption model to match the texture of lesion areas with the corresponding descriptions from the retrieved medical knowledge to generate high-quality annotations. In quantitative validation, we conduct human expert evaluation and LLM evaluation. As shown in Table 2 of the original manuscript, our dataset achieves scores of 0.85 and 0.86 out of 1.00 in expert and LLM evaluations, respectively, with modality, organ detection, and ROI analysis nearing perfect scores. These quantitative metrics indicate the high quality of our dataset. ### Question 2 (Comparison with Expert-Labeled Datasets) > Has a comparison with expert-labeled datasets been considered to further quantify the quality of automated labeling? As shown in Table 2 of the original manuscript, our dataset achieves scores of 0.85 and 0.86 out of 1.00 in expert and LLM evaluations, respectively. This suggests that its quality is comparable to that of expert-labeled datasets. Furthermore, in Figure 1, we provide qualitative examples that directly compare our dataset with expert-labeled datasets. These examples demonstrate that our dataset covers more aspects and provides more comprehensive information than the expert-labeled datasets MIMIC-CXR and SLAKE, highlighting its increased level of detail. Additionally, Figure 8(d) compares the average word count of text descriptions between our dataset and several expert-labeled datasets. Our dataset exhibits a significantly higher word count, indicating greater richness. ### Weakness 3 & Question 3 (Potential Bias) > How does the dataset address potential biases in source data? We appreciate your thorough concerns regarding the potential biases in data distribution. Since our method aims to construct large-scale multimodal datasets by assembling existing public medical image datasets, our dataset inevitably inherits any potential biases present in the original public data. For public medical image datasets with biases in demographics and disease distribution, we plan to implement two strategies to address these biases. First, we will uniformly sample a high-quality subset from MedTrinity-25M to reduce existing biases. Second, we will utilize our automated pipeline to generate additional data for rare diseases and underrepresented demographics, aiming to achieve a more balanced and comprehensive dataset. ### Weakness 4 & Question 5 (Test Set Leakage) > How to ensure that MedTrinity-25M does not contain data from test set datasets? Thank you for bringing up this important concern about potential test set leakage. As shown in Table 5 of Appendix A, we verify that MedTrinity-25M includes only the training sets of VQA-RAD, PathVQA, and SLAKE, and does not contain any data from the validation or test sets of these datasets. ## Reviewer d16k ### Weakness 1 (Data Quality) > ... MedTrinity-25M are generated by the proposed automated pipeline, raising concerns about the accuracy of the generated text ... an expert-based evaluation .. resulted in an accuracy score of 85%. We appreciate your concerns regarding the accuracy of the generated text, as reflected in the expert evaluation score of 85\%. This score represents the average across five evaluation aspects: modality, organ detection, region of interest (ROI) analysis, lesion texture, and region-wise correlations. While near-perfect scores were achieved in modality, organ detection, and ROI analysis, the accuracy in lesion texture and region-wise correlations may be less satisfactory. Upon examining cases with relatively lower scores, we found that inaccuracies primarily stemmed from the omission of certain terminologies. We hypothesize that this issue arises from gaps in our medical knowledge base. Since the knowledge base is constructed from public resources (e.g., PubMed), it inherently reflects biases—common diseases are often described in greater detail, whereas rare diseases in specific domains may have incomplete or coarse descriptions. These gaps can result in generated descriptions that omit crucial terminologies. To address this, we plan to expand our knowledge base with a more comprehensive corpus that covers a broader range of diseases, which we believe will enhance the accuracy of the generated descriptions. Although the results for lesion texture and region-wise correlations are slightly less satisfactory, our dataset still demonstrated significant contributions to the performance of multimodal learning models. As shown in Table 3 of the original manuscript, it improved performance by 10.8%, 6.1%, and 8.3% on the VQA-RAD, SLAKE, and PathVQA datasets, respectively, despite the presence of some annotation noise. ### Weakness 3 (Result on Report Generation) > ... overlooking its potential application to ... such as visual report generation Following your suggestion, we conducted ablation studies to evaluate the effectiveness of multigranular alignment in report generation on the MIMIC-CXR dataset. Specifically, we finetune the baseline models of LLaVA-Tri and LLaVA-pp[1] on a small set of 10k samples in our proposed dataset due to limited time. The results are summarized in the table below: | Model | BLEU-1 | BLEU-4 | BERT Score | |--------------------|--------|--------|------------| | LLaVA-Tri (w/ our dataset) | 28.2 | 6.9 | 32.4 | | LLaVA-Tri (w/o our dataset) | 22.2 | 1.0 | 20.1 | | LLaVA-pp (w/ our dataset) | 19.3 | 0.8 | 23.6 | | LLaVA-pp (w/o our dataset) | 16.8 | 0.8 | 19.5 | As shown in the table above, finetuning with a small set of our dataset significantly improves performance across the tested models. For LLaVA-Tri, BLEU-1 increased from 22.2 to 28.2, BLEU-4 from 1.0 to 6.9, and BERT Score from 20.1 to 32.4. For LLaVA-pp, BLEU-1 rose from 16.8 to 19.3 and BERT Score from 19.5 to 23.6. These results demonstrate the potential of multigranular alignment in enhancing report generation. [1] Rasheed, H., et al. "Llava++: Extending visual capabilities with llama-3 and phi-3 (2024)." ### Question 1 (Metadata) > ...what approach is taken if the metadata lacks organ or disease labels? To ensure that necessary information is available, we source metadata from various and easily accessible resources, including publications and websites related to the dataset. This metadata serves as the basis for retrieving detailed knowledge. In rare cases where the disease type is missing in metadata, we utilize the available information to retrieve relevant knowledge. For example, if the metadata does not specify a disease like a brain tumor but indicates that the image is of the brain, we use this information to retrieve knowledge from our knowledge base. The retrieved knowledge will still provide sufficient detail about various brain diseases, not limited to brain tumors. ### Question 2 (Choice of GPT-4V) > Why was GPT-4V specifically chosen? The choice of MLLMs is flexible and not the main focus of our paper. Our primary contribution lies in developing an automated pipeline capable of scaling up multimodal data by generating multigranular annotations from unpaired image inputs. In our pipeline, MLLMs are tasked with producing multigranular annotations based on complex instructions. Therefore, we selected GPT-4V for this purpose due to its strong ability to follow detailed instructions and generate comprehensive outputs [2]. However, it is important to note that while GPT-4V was chosen for its superior performance in instruction adherence, the pipeline is designed to be flexible and can readily incorporate other MLLMs as needed. [2] Peng, Baolin, et al. "Instruction tuning with gpt-4." arXiv preprint arXiv:2304.03277 (2023). ### Question 3 (ROI Questions) > For images without bounding boxes or masks from the original data source, how does the pipeline generate the ROI using expert models? Is ROI generation based on organ or disease labels, and how accurate are the generated ROIs? Additionally, if an image contains multiple ROI regions, how is this managed? For details of ROI generation using expert models, we utilize four different models, as illustrated in Table 6 of our original paper. For expert models that support text input, such as DINO and SAT, we use disease names as text prompts to ground lesion areas. The expert model HoverNet automatically grounds lesion areas in histopathology images, such as neoplastic or inflammatory regions. Similarly, HybridGNet in CheXmask automatically grounds the lungs and heart in chest radiography images. To assess the accuracy of the generated ROIs, we conducted evaluations using human experts and LLMs, as shown in Table 2 of the original paper. Both evaluations achieved scores of 0.9 out of 1.0, indicating relatively high accuracy. Furthermore, for images containing multiple ROIs that encompass all disease-associated areas, we provide detailed analyses of each ROI, including texture and patterns. This local analysis contributes to the overall diagnosis. ## Reviewer DJFs ### Weakness 1 & Question 1 (Handling of Visual Data Without ROIs) > For medical images that lack explicit ROI annotations, how does your pipeline construct the corresponding multimodal data pairs? For medical images without explicit ROI annotations, our pipeline leverages expert models to generate ROIs. Details about the expert models used for ROI generation can be found in Table 6 in Appendix D of original manuscript. ### Weakness 2 & Question 2 (Assessing the Impact of Multigranular Information on Med-MLLMs) > Considering that ROIs and multigranular annotations are central to your dataset, have you conducted experiments to evaluate how these features specifically affect the performance of medical MLLMs? Thank you for your insightful question. We have conducted detailed experiments to assess the impact of incorporating multigranular information on the performance of medical MLLMs. The results, presented in Table 4 and discussed in Section 4.2 of the original manuscript, demonstrate the advantages of integrating multigranular information into the training process. To provide a detailed illustration of how incorporating metadata, ROI, and RAG modules affects generation quality, we have included comprehensive examples in Figures 3, 4, and 5 of the original paper. These examples show that incorporating these features can significantly improve the quality of the generated annotations. ### Weakness 3 & Question 3 (Comparison with Existing Multigranular Benchmarks) > Could you elaborate on how MedTrinity-25M compares with existing benchmarks that involve multigranular information? Thank you for highlighting the need for a more thorough discussion. We appreciate your suggestion to include works such as GMAI-MMBench [1] and Asclepius [2]. These are indeed valuable contributions to the field, and we will incorporate a discussion of these works along with proper citations in the updated version of our manuscript. In addition, while GMAI-MMBench and Asclepius provide multi-granularity annotations, they do not explicitly offer detailed descriptions of lesion characteristics, nor is there explicit evidence that they capture or annotate relationships or correlations between different regions. We provide a comparison with these benchmarks that involve multigranular information in the table below: | Dataset | Modality | Lesion Type | Lesion BBox/Mask | Lesion Description | Region-wise Correlations | |:-----------------------------:|:-------------------:|:--------------------:|:----------------:|:------------------:|:------------------------:| | GMAI-MMBench [1] | yes | yes | yes | no | no | | Asclepius [2] | yes | yes | yes | no | no | | MedTrinity-25M (Ours) | yes | yes | yes | yes | yes | [1] Chen, P., Ye, J., Wang, G., Li, Y., Deng, Z., Li, W., ... & Qiao, Y. (2024). Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai. arXiv preprint arXiv:2408.03361. [2] Wang, W., Su, Y., Huan, J., Liu, J., Chen, W., Zhang, Y., ... & Lyu, M. R. (2024). Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models. arXiv preprint arXiv:2402.11217. ### Question 4 & Weakness 4 (Evaluation on Multigranular-Specific Tasks) > do you have plans to test your dataset on tasks specifically designed for multigranular information? How can you demonstrate the effectiveness of your dataset's detailed annotations in improving model performance on such tasks? To the best of our knowledge, there is no existing benchmark specifically designed to evaluate multigranular information generation. We have provided both qualitative and quantitative results on the report generation task using the MIMIC-CXR dataset, which is challenging and requires the ability to generate multi-granular answers for chest radiology. The results are summarized in the table below: | Model | BLEU-1 | BLEU-4 | BERT Score | |---------------------------------|--------|--------|------------| | LLaVA-Tri (with our dataset) | 28.2 | 6.9 | 32.4 | | LLaVA-Tri (without our dataset) | 22.2 | 1.0 | 20.1 | | LLaVA-pp (with our dataset) | 19.3 | 0.8 | 23.6 | | LLaVA-pp (without our dataset) | 16.8 | 0.8 | 19.5 | Here is a qualitative comparison of a sample with study ID 67 from the MIMIC-CXR dataset: - Result (trained on multigranular annotations): The lungs are well expanded and clear. The cardiomediastinal silhouette,hilar contours, and pleural surfaces are normal. No pleural effusion or pneumothorax is present. - Result (not trained on multigranular annotations): The cardiomediastinal silhouette is normal. There is no pleural effusion or pneumothorax. There is no focal lung consolidation. There is no acute osseous abnormality. - Ground Truth: The lungs are well inflated and clear. The cardiomediastinal silhouette, hilar contours, and pleural surfaces are normal. There is no pleural effusion or pneumothorax. The quantitative results show that fine-tuning with a small set of our dataset significantly improves generation performance across the tested models. The qualitative comparison further illustrates that training on multigranular annotations helps the model generate more fine-grained and professional text, aligning more closely with human reports. ### Question 5 & Weakness 5 (Addressing Unique Medical Image Presentations in RAG ) > How does your RAG approach handle the specificity and variability of individual medical image descriptions? Our RAG approach can handle the diversity and specificity inherent in individual medical cases by employing a comprehensive knowledge base from extensive medical literature including 23.9 million biomedical articles from PubMed, covering various diseases and their known manifestations, such as those related to atelectasis. For diseases with diverse descriptions available in these articles, our knowledge base is designed to preserve and incorporate all relevant descriptions in their entirety. For instance, we are able to retrieve articles like "Atelectasis: mechanisms, diagnosis and management [3]" detail that atelectasis may occur in three ways: (i) airway obstruction; (ii) compression of parenchyma by extrathoracic, intrathoracic, or chest wall processes; and (iii) increased surface tension in alveoli and bronchioles. Subsequently, we utilize the LLM to align the retrieved texts describing the three manifestations of atelectasis with the visual features extracted from the medical images. Through this image-text alignment, the LLM generates annotations that accurately reflect each medical case's unique presentation, effectively capturing the specific disease manifestations. [3] Peroni, D. G., and A. L. Boner. "Atelectasis: mechanisms, diagnosis and management." Paediatric respiratory reviews 1.3 (2000): 274-278. When training models, we have thoroughly reviewed the data sources for VQA-RAD, PathVQA, and SLAKE, as detailed below: | Dataset | Source | | :------- | :------------------------------- | | VQA-RAD | MedPix [1] | | PathVQA | PEIR Digital Library [2] | | SLAKE | MSD [3], ChestX-ray8 [4], CHAOS [5] | We have ensured that the data sources for VQA-RAD and PathVQA do not overlap with those in MedTrinity-25M. However, SLAKE includes two overlapping sources: MSD and CHAOS. As noted in Table 5 of Appendix A in our original paper, the SAMMed-20M dataset [6] is included. This dataset integrates multiple sources, including MSD and CHAOS. To address this issue, we trained our model on a subset of MedTrinity-25M that excluded data from MSD and CHAOS. This ensures that the test sets from these datasets were not present in the training data. Thank you for point out it and we will clarify this adjustment in the revised version of the paper. References: 1. MedPix: [https://medpix.nlm.nih.gov/](https://medpix.nlm.nih.gov/) 2. Jones, Kristopher N., et al. "PEIR digital library: Online resources and authoring system." Proceedings of the AMIA Symposium. American Medical Informatics Association, 2001. 3. Antonelli, Michela, et al. "The medical segmentation decathlon." *Nature communications* 13.1 (2022): 4128. 4. Wang, Xiaosong, et al. "Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." *IEEE CVPR*. Vol. 7. sn, 2017. 5. Kavur, A. Emre, et al. "CHAOS challenge-combined (CT-MR) healthy abdominal organ segmentation." *Medical Image Analysis* 69 (2021): 101950. 6. Ye, Jin, et al. "Sa-med2d-20m dataset: Segment anything in 2d medical imaging with 20 million masks." arXiv preprint arXiv:2311.11969 (2023). ### r1 **For Q3:** >Have you considered using statistical methods to analyze potential biases in the data distribution, rather than addressing this in a future plan? As demonstrated in Figures 8(a) and 8(b) of the original paper, we provided the distribution of modalities and biological structures. Since demographic distributions are not fully provided in the public datasets we assembled, it is challenging for us to investigate potential biases arising from this factor. Here, we further provide detailed percentages of each modality, biological structure, and disease in the following tables: **Distribution of modalities:** | Modality | MR | Histopathology | CT | Microscopy | X-Ray | Endoscopy | PET | Dermoscopy | Ultrasound | |----------------|-----------|----------------|-----------|------------|-----------|------------|-----------|------------|------------| | Percentage | 48.3160% | 23.6195% | 19.4895% | 6.6528% | 1.4649% | 0.2440% | 0.1709% | 0.0395% | 0.0027% | **Distribution of biological structures:** | Biological Structure | coccyx | scrotum | fallopian | sacrum | gonad | seminal | testes | |-----------------------|----------|----------|-----------|----------|----------|----------|----------| | Percentage | 0.0003% | 0.0012% | 0.0032% | 0.0034% | 0.0065% | 0.0076% | 0.0076% | | Biological Structure | ovaries | rectum | uterus | vagina | prostate | pelvis | bladder | |-----------------------|----------|----------|----------|----------|----------|----------|----------| | Percentage | 0.0084% | 0.0088% | 0.0142% | 0.0189% | 0.0634% | 0.1553% | 0.4770% | | Biological Structure | vas | cecum | jejunum | ileum | appendix | urethra | adrenals | |-----------------------|----------|----------|----------|----------|----------|----------|----------| | Percentage | 3.8936% | 0.0034% | 0.0057% | 0.0078% | 0.0095% | 0.0137% | 0.0147% | | Biological Structure | duodenum | mesentery | peritoneum | ureter | kidney | stomach | kidneys | |-----------------------|----------|-----------|------------|----------|----------|----------|----------| | Percentage | 0.0161% | 0.0174% | 0.0174% | 0.0213% | 0.0708% | 0.0713% | 0.2335% | | Biological Structure | colon | gastral | gallbladder | spleen | renal | liver | aorta | |-----------------------|----------|----------|-------------|----------|----------|----------|----------| | Percentage | 0.2561% | 0.3835% | 0.4045% | 0.4050% | 2.3602% | 2.8014% | 3.8473% | | Biological Structure | pancreas | fibula | humerus | tendon | femur | connective | cartilage | |-----------------------|----------|----------|----------|----------|----------|------------|-----------| | Percentage | 0.1821% | 0.0056% | 0.0109% | 0.0138% | 0.0215% | 0.0342% | 0.0362% | | Biological Structure | ligament | joint | epithelium | muscle | skin | arteries | tissue | |-----------------------|----------|----------|------------|----------|----------|----------|----------| | Percentage | 0.0364% | 0.0662% | 0.1228% | 0.1738% | 0.3031% | 0.7442% | 1.0688% | | Biological Structure | organ | rib | cell | bone | lymph | clavicle | thymus | |-----------------------|----------|----------|----------|----------|----------|----------|----------| | Percentage | 2.1981% | 2.6240% | 2.9955% | 3.8757% | 4.1436% | 0.0131% | 0.0276% | | Biological Structure | sternum | diaphragm | mediastinum | lungs | pleura | breast | heart | |-----------------------|----------|-----------|-------------|----------|----------|----------|----------| | Percentage | 0.0859% | 0.0891% | 2.6558% | 2.8797% | 3.1388% | 3.5896% | 3.7190% | | Biological Structure | bronchi | temple | throat | cerebrum | jaw | larynx | tonsil | |-----------------------|----------|----------|----------|----------|----------|----------|----------| | Percentage | 3.8148% | 0.0005% | 0.0007% | 0.0037% | 0.0050% | 0.0050% | 0.0058% | | Biological Structure | mouth | scalp | teeth | pituitary | tongue | mandible | saliva | |-----------------------|----------|----------|----------|-----------|----------|----------|----------| | Percentage | 0.0080% | 0.0081% | 0.0105% | 0.0109% | 0.0111% | 0.0114% | 0.0144% | | Biological Structure | pharynx | skull | maxilla | cervix | cerebellum | nasal | nose | |-----------------------|----------|----------|----------|----------|------------|----------|----------| | Percentage | 0.0151% | 0.0162% | 0.0193% | 0.0235% | 0.0237% | 0.0283% | 0.0519% | | Biological Structure | sinus | spine | nerve | eye | thyroid | face | vein | |-----------------------|----------|----------|----------|----------|----------|----------|----------| | Percentage | 0.0527% | 0.0738% | 0.1459% | 0.2174% | 0.2652% | 0.3096% | 0.3957% | | Biological Structure | artery | gland | ear | esophagus | trachea | brain | |-----------------------|----------|----------|----------|-----------|----------|----------| | Percentage | 1.0275% | 4.7914% | 5.0004% | 6.0310% | 7.6411% | 19.4752% | ### r2 **For Q3:(continue)** Distribution of diseases: | Disease | bone fracture | calc,mass | lower-grade glioma | pneumonia | pneumothorax | covid-19 | |---------------------------------|---------------|-----------|--------------------|-----------|--------------|----------| | Percentage | 0.0419% | 0.0627% | 0.0893% | 0.2590% | 0.2693% | 1.0161% | | Disease | Pediatric Bacterial Pneumonia | Pediatric Viral Pneumonia | Atelectasis | Cardiomegaly Consolidation | Edema | Emphysema | |---------------------------------|-------------------------------|---------------------------|--------------|---------------------------|-----------|------------| | Percentage | 0.8340% | 0.8340% | 1.0582% | 1.0582% | 1.0582% | 1.0582% | | Disease | Fibrosis | Hernia | Infiltration | Opacity | breast cancer | brain tumor | |---------------------------------|-----------|-----------|---------------|-----------|----------------|-------------| | Percentage | 1.0582% | 1.0582% | 1.0582% | 1.0582% | 15.6893% | 25.8410% | | Disease | astrocytoma | carcinoma | ependymoma | ganglioglioma | germinoma | glioblastoma | |---------------------------------|-------------|-----------|------------|---------------|-----------|--------------| | Percentage | 0.0400% | 0.0400% | 0.0400% | 0.0400% | 0.0400% | 0.0400% | | Disease | granuloma | medulloblastoma | meningioma | neurocytoma | oligodendroglioma | papilloma | |---------------------------------|-----------|-----------------|-----------|-------------|-------------------|-----------| | Percentage | 0.0400% | 0.0400% | 0.0400% | 0.0400% | 0.0400% | 0.0400% | | Disease | schwannoma | tuberculoma | Glioma Meningioma Pituitary | hepatocellular carcinoma | intrahepatic cholangiocarcinoma (ICC) | liver metastases (HM) | |---------------------------------|------------|-------------|----------------------------|-------------------------|---------------------------------------|-----------------------| | Percentage | 0.0400% | 0.0400% | 0.1879% | 0.3126% | 0.3126% | 0.3126% | | Disease | hepatic cysts (HC) | hepatic hemangioma | focal nodular hyperplasia | hepatic abscess | Leukemia | lung cancer | |---------------------------------|---------------------|--------------------|---------------------------|----------------|-----------|-------------| | Percentage | 0.3126% | 0.3126% | 0.3126% | 0.3126% | 0.1528% | 0.1056% | | Disease | cancer | Lung Cancer | Breast Cancer | Canine Lymphoma | Canine Cutaneous Mast Cell Tumor | melanoma | |---------------------------------|-----------|-------------|---------------|-----------------|--------------------------------|----------| | Percentage | 16.2363% | 0.2075% | 0.2075% | 0.2075% | 0.2075% | 0.2075% | | Disease | colorectal cancer | colon adenocarcinomas | tubular adenocarcinoma | lung adenocarcinomas | lung squamous cell carcinomas | prostate cancer | |---------------------------------|-------------------|-----------------------|-----------------------|-----------------------|---------------------------|----------------| | Percentage | 0.3154% | 0.0505% | 2.6454% | 0.1010% | 0.1010% | 0.1897% | | Disease | carcinogenic DNA damages | Cutaneous Spindle Cell neoplasms | Diabetic Retinopathy | diabetic retinopathy, cataract, glaucoma | Diabetic Retinopathy, Cataract and Glaucoma | Age related Macular Degeneration | |---------------------------------|--------------------------|---------------------------------|---------------------|----------------------------------------|---------------------------------------------|--------------------------------| | Percentage | 0.0081% | 0.5364% | 0.1880% | 0.0317% | 0.5848% | 0.0717% | | Disease | Hypertension | Pathological Myopia | polyps | esophagitis | ulcerative-colitis | melanoma | |---------------------------------|-------------|-------------------|-----------|-------------|--------------------|----------| | Percentage | 0.0717% | 0.0717% | 0.0101% | 0.0101% | 0.0101% | 0.0298% | | Disease | nevus,atypical,melanoma | Monkeypox | Actinic keratoses | intraepithelial carcinoma / Bowen's disease | basal cell carcinoma | benign keratosis-like lesions(solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl) | |---------------------------------|------------------------|----------|------------------|-------------------------------------------|---------------------|--------------------------------------------------------------------------------------------------------| | Percentage | 0.1265% | 0.0273% | 0.0273% | 0.0273% | 0.0273% | 0.0273% | | Disease | dermatofibroma | melanoma | angiomas | angiokeratomas | pyogenic granulomas | hemorrhage | |---------------------------------|----------------|----------|-----------|----------------|---------------------|------------| | Percentage | 0.0273% | 0.0273% | 0.0273% | 0.0273% | 0.0273% | 0.0273% | | Disease | melanocytic nevi | vascular lesion | Brain Hemorrhage | cervical cancer | kidney tumor | kidney stone | |---------------------------------|------------------|----------------|----------------|----------------|-------------|--------------| | Percentage | 0.0273% | 0.0273% | 18.4954% | 0.1181% | 0.4754% | 0.0131% | | Disease | liver tumor | lymph node | |---------------------------------|-------------|------------| | Percentage | 1.3116% | 0.1738% | ### r3 **For Q3:(continue)** Furthermore, following your suggestion, we provide a statistical analysis of the data distribution in MedTrinity-25M. We selecte the following statistics to analyze the data distribution: 1. Gini Coefficient, which measures the inequality among values and is formulated as follows: $$ G=\frac{\sum_i \sum_j\left|x_i-x_j\right|}{2 n^2 \bar{x}} $$ 2. KL Divergence between the data distribution of our dataset and a uniform distribution. $$ D_{K L}(x \| p)=\sum_i x_i \log \left(\frac{x_i}{p_i}\right), \text{where } p_i=1 / n $$ 3. Normalized Entropy, which is formulated as:: $$ H_{\text {norm }}=\frac{H(x)}{\log (n)}, \text{where } H(x)=-\sum_i x_i \log \left(x_i\right) $$ The results for the distributions of modalities, biological structures, and diseases are shown in the following table: | Metric | Modalities | Biological Structures | Diseases | |---------------------|-----------|-----------|-----------| | Gini Coefficient | 0.6868 | 0.8145 | 0.8691 | | KL Divergence | 0.9151 | 1.4488 | 2.0176 | | Normalized Entropy | 0.5835 | 0.6833 | 0.5482 | From these results, we can observe certain biases within our dataset. We plan to address these biases by (1) uniformly sampling a high-quality subset from MedTrinity-25M, and (2) using our generation pipeline to generate additional data for rare diseases or modalities. ### r4 For Q5: > Table 5 only demonstrates that MEDTRINITY-25M includes the training sets of VQA-RAD, PathVQA, and SLAKE, but it does not confirm the absence of these datasets in the test set. Given that the dataset spans over 20 sources and LLaVA-Tri achieves 99% accuracy on PathVQA, have you considered conducting statistical experiments to verify this point? In our experiments, we have thoroughly reviewed the data sources for VQA-RAD, PathVQA, and SLAKE, as detailed below: | Dataset | Source | | :------- | :------------------------------- | | VQA-RAD | MedPix [1] | | PathVQA | PEIR Digital Library [2] | | SLAKE | MSD [3], ChestX-ray8 [4], CHAOS [5] | We ensured that the data sources for VQA-RAD and PathVQA do not overlap with those in MedTrinity-25M, as we have not included MedPix or the PEIR Digital Library as data sources. Furthermore, we confirmed that our other data sources do not include data from VQA-RAD or PathVQA. However, SLAKE includes two overlapping sources: MSD and CHAOS. As noted in Table 5 of Appendix A in our original paper, the SAMMed-20M dataset [6] is included as a data source. This dataset integrates multiple sources, including MSD and CHAOS. To address this issue, in our experiments, we trained our model on a subset of MedTrinity-25M that excludes data from MSD and CHAOS. This ensures that the test sets from these datasets were not present in the training data. We appreciate your pointing this out, and we will clarify this adjustment in the revised version of the paper. References: 1. MedPix: [https://medpix.nlm.nih.gov/](https://medpix.nlm.nih.gov/) 2. Jones, Kristopher N., et al. "PEIR digital library: Online resources and authoring system." Proceedings of the AMIA Symposium. American Medical Informatics Association, 2001. 3. Antonelli, Michela, et al. "The medical segmentation decathlon." *Nature communications* 13.1 (2022): 4128. 4. Wang, Xiaosong, et al. "Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." *IEEE CVPR*. Vol. 7. sn, 2017. 5. Kavur, A. Emre, et al. "CHAOS challenge-combined (CT-MR) healthy abdominal organ segmentation." *Medical Image Analysis* 69 (2021): 101950. 6. Ye, Jin, et al. "Sa-med2d-20m dataset: Segment anything in 2d medical imaging with 20 million masks." arXiv preprint arXiv:2311.11969 (2023).

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully