TMUNLPG3 at the NTCIR-18 RadNLP Task

#### NTCIR-18 @ Tokyo *** ## TMUNLPG3 at the NTCIR-18 RadNLP Task :::info **Wen-Chao Yeh (speaker), Yan-Chun Hsing, Tzu-Yi Li, Nitisalapa Timsatid, Shih-Chuan Chang, Shih-Hsin Hsiao, Chu-Chun Wang, Pak-Yue Chan, Wen-Lian Hsu and Yung-Chun Chang** ::: 2025-06-13 https://hackmd.io/@wyeh/ntcir18-radnlp --- ## Agenda *** ### who we are ### What do we need to do and achieve ### How do we accomplish this ### conclusion ### Q&A --- ## who are tmunlpg3? --- ### we are ![logo](https://hackmd.io/_uploads/SJsgMTAzxl.png =156x) and domain experts *** |||||| |:-:|:-:|:-:|:-:|:-:| |Wen-Chao Yeh![圖片](https://hackmd.io/_uploads/BkK1C9CGxl.png =64x)NTHU PhD Student|Yan-Chun Hsing![ych](https://hackmd.io/_uploads/Sk9bNjAMel.png =64x)TMU Research Assistant|Tzu-Yi Li![TYL](https://hackmd.io/_uploads/B13frjRfle.png =64x)TMU Health Care A. BS|Nitisalapa Timsatid![enrich](https://hackmd.io/_uploads/HJ3_ZhAGlg.png =64x)TMU Data Science MS|Shih-Chuan Chang![SCC](https://hackmd.io/_uploads/ByNyGh0zxe.png =64x)TMU Data Science MS| |Shih-Hsin Hsiao![圖片](https://hackmd.io/_uploads/rkPPysCGgg.png =64x)TMUH Doctor|Chu-Chun Wang![Untitled](https://hackmd.io/_uploads/rylIQhAMle.png =80x)TMUH Senior RA|Pak-Yue Chan![Untitled](https://hackmd.io/_uploads/rylIQhAMle.png =80x)TMU Medicine BS|Wen-Lian Hsu![許聞廉老師](https://hackmd.io/_uploads/ryBNmsCzxe.png =64x)NTHU Professor|Yung-Chun Chang![圖片](https://hackmd.io/_uploads/SJ1wmi0Gee.png =64x)TMU Professor| |||||| --- ## What do we need to do and achieve --- ## RadNLP main task *** A multi-label document classification to correctly determine [T, N, and M](https://www.haigan.gr.jp/publication/guideline/examination/2022/1/0/220100000000.html) categories for each radiology report. **T** - the size and/or extension of the primary lesion: ==T0, Tis, T1mi, T1a, T1b, T1c, T2a, T2b, T3, T4== **N** - the extent of lymph node metastasis: ==N0, N1, N2, N3== **M** - the extent of distant metastasis: ==M0, M1a, M1b, M1c== --- ## RadNLP sub task *** a document segmentation (sentence level) to identify up to eight spans related to the following topics Omittable Measure Extension Atelectasis Satellite Lymphadenopathy Pleural Distant --- ### What we've achieved *** :::success RadNLP aims to automatically determine the TNM stage of lung cancer from radiology reports. ::: ||English Track|Japanese Track| |:-:|:-:|:-:| |**Main Task** TNM Staging|:first_place_medal::second_place_medal:|:seven:| |**Sub Task** Multi-label Sentence Classification|:second_place_medal: |:four:| --- ## How do we accomplish this? --- ### Three Steps to Tackle Challenges *** ==**Analysis Dataset**== Distribution of training set and validation set Opinions from medical doctor, radiologist and pathologist ==**Methods**== Last year's approach? LLMs and Pre-trained models (e.g., BERT) ==**Optimization**== Analyze errors Fine-tune prompts --- ### Opinions from medical experts, case 1 *** ![圖片 4](https://hackmd.io/_uploads/S1CwnjwQeg.png) --- ### Opinions from medical experts, case 2 *** ![圖片 2](https://hackmd.io/_uploads/Bkn5niw7ll.png) --- ### Opinions from medical experts, case 3 *** ![圖片 3](https://hackmd.io/_uploads/HJxVo3ovQeg.png) --- ### Opinions to guidelines *** Medical doctor, radiologist and pathologist - Annotate training dataset according to their expertise - Compare with released labels - Provide analysis of discrepancies between the two ==Prompt Writing Guidelines== --- ### Dataset quantities *** ![截圖 2025-06-12 上午9.58.02](https://hackmd.io/_uploads/S11ILjwmeg.png) --- ### TNM Classification Distribution *** ![圖片](https://hackmd.io/_uploads/rkrFeARfeg.png) Note: T: T2b and T4 are the most common labels (Train: 20, 31; Val: 9, 18), whereas T1mi and T1a are underrepresented. N: N0 (Train: 41, Val: 26) and N2 (Train: 45, Val: 20) dominate, while N3 cases are sparse. M: M0 cases (Train: 74, Val: 27) far outnumbering M1 subcategories, particularly M1a (Train: 0, Val: 9) and M1b (Train: 14, Val: 0). --- ### SubTask label Distribution *** Train(918)+Val(415) = 1,333 ![截圖 2025-06-05 下午4.28.27](https://hackmd.io/_uploads/r1sHBRRfge.png =512x) --- ### methods *** Total: 5 systems Three adopted LLMs, two use BERT-related models. ||English Track| |:-:|:-:| |**Main Task** TNM Staging|:first_place_medal: System I :second_place_medal: System II| |**Sub Task** Multi-label Classification|:second_place_medal: System II| --- ### system I: Architecture *** Using LLM and Few-Shots (report+label+reasoning) ![圖片](https://hackmd.io/_uploads/r170tR0zeg.png) Image Credit: https://lena-voita.github.io/nlp_course/language_modeling.html --- ### System I: Reasoning *** Using ChatGPT-4o to analyze the [rationale](https://github.com/nlptmu/NTCIR-18-RadNLP/blob/main/train_GPT4o_inference_en.csv) behind TNM annotations in each radiology report within the training dataset. >[!Tip] Reasoning from GPT-4o, Example: 1863157 >T: T1b N: N0 M: M0 Report: A nodule measuring 12 mm in diameter in the right lower lobe S8/9, increasing in size, suspected to be a known lung cancer. Reticular opacities present in the right S6, likely inflammatory changes. Calcified pleural plaques on both sides, suggesting the possibility of asbestos-related disease. No significant enlargement of the mediastinal and hilar lymph nodes or other mediastinal lesions. No pleural effusion. Gallstones. Reasoning: The nodule is 12mm, which fits T1b (tumor >10mm but ??0mm). No lymph node enlargement suggests N0 (no regional lymph node metastasis). Absence of distant metastasis or pleural effusion indicates M0 (no distant metastasis). These factors collectively account for the TNM classification T1b/N0/M0. --- ### System I: Prompt design *** Combines guidelines with 7 randomly selected examples from training set (including report, reasoning and answers) to guide the model in generating TNM by following the examples. ![圖片](https://hackmd.io/_uploads/SJgE0xy1mgx.png) --- ### System I: Prediction *** - a report :arrow_right: three rounds of predictions - Hard vote on the T-stage from the above 3 predictions - Hard vote on the N-stage from the above 3 predictions - Hard vote on the M-stage from the above 3 predictions - Determine the conclusive result. --- ### system II: Architecture *** ```mermaid graph TB A[Input: Radiology Report] --> B[DSPy Framework] subgraph B[DSPy Framework] C[DetermineCOTandConclude Module] subgraph C[DetermineCOTandConclude Module] D[Multiple Opinion Generation] E[Opinion Synthesis] subgraph D[Multiple Opinion Generation] F[GPT-4 Analysis] G[Gemini-2 Analysis] F --> I[Opinion 1,3,5] G --> J[Opinion 2,4] end subgraph E[Opinion Synthesis] L[DetermineStagingConclude] subgraph L[DetermineStagingConclude] M[T Staging Classification] N[N Staging Classification] O[M Staging Classification] end end end end B --> P[Final TNM Classification] P --> Q[T Category Output] P --> R[N Category Output] P --> S[M Category Output] ``` ---- ### ![圖片](https://hackmd.io/_uploads/r1hvvkmmex.png =128x) *** [dspy.ai document](https://dspy.ai/) ```python= import dspy lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_OPENAI_API_KEY') dspy.configure(lm=lm) from typing import Literal class Classify(dspy.Signature): """Classify sentiment of a given sentence.""" sentence: str = dspy.InputField() sentiment: Literal['positive', 'negative', 'neutral'] = dspy.OutputField() confidence: float = dspy.OutputField() classify = dspy.Predict(Classify) classify(sentence="This book was super fun to read, though not the last chapter.") ### Possible Output: Prediction( sentiment='positive', confidence=0.75 ) ``` --- ### System II: Prompt Design *** IASLC 8th ed. lung cancer staging system - Intern doctor 1: gpt-4o uses chain-of-thought to determine TNM staging and opinions. - Intern doctor 2: gemini-2 uses chain-of-thought to determine TNM staging and opinions. - Senior doctor: Reviews the staging decisions and reasoning from both intern doctors, then makes the final decision. This output serves as the answer. --- ### System II: Split Dataset *** - Randomly split 108 training data points into 54 few-shot reference cases and 54 validation cases. - The original validation set and test set were retained for prediction inference and submission. - Overfitting prevention. --- ### System II: Optimization *** - Used the MIPROv2 method to select 50 few-shot reference cases for auto-prompt instruction fine-tuning design. - The 54 validation cases were used to test which modifications performed better. - The prompt was then optimized based on these results. Krista Opsahl-Ong, Michael J. Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. 2024. Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. arXiv:2406.11695 [cs]. ---- ### MIPROv2 *** [![miprov2](https://hackmd.io/_uploads/ryEq8kmmge.png)](https://x.com/michaelryan207/status/1804189184988713065) --- ### System II for subtask *** - Same as System II architecture for main task. - LLM Model: Llama 3.3 70B 16 bits - 8 RTX 3090 GPUs (24GB VRAM each) - 300 shots (LLM model context windows limits) --- ### Analyze errors to refine the prompt *** ![圖片](https://hackmd.io/_uploads/H1RJIk1Qel.png) Five cases of 'Tis' were misclassified as 'T1b'. --- ### Analyze errors to refine the prompt *** Tumors without evidence of invasion should be prioritized for Tis classification, even if the size approaches criteria for other stages like T1b. Do not classify tumors as T1b based solely on size; clear pathological evidence of invasion is required. If there is uncertainty, always select the most conservative stage. ![圖片](https://hackmd.io/_uploads/rJizUk1Qxx.png) --- ## conclusion --- ### Main task, English track *** System-I-MT-En secured first place with impressive metrics - 65.43% joint fine accuracy - 69.14% joint coarse accuracy - The system demonstrated strong individual performance in - T (70.37%) - N (91.36%) - M (88.89%) --- ### Sub task, English track *** System-II-ST-En achieved second place with a notable overall micro F2.0 score of 93.36%. --- ### Finding *** This success is attributed not only to the implementation of ==**large language models**== but also to the application of ==**few-shot**== prompting engineering and structured ==**reasoning**== in TNM classification. --- ### Finding *** A key advantage of our approach is the ==**integration of expert medical knowledge**==, consulting with experienced doctors, radiologist and pathologist to validate and refine the system. --- ### Future work *** Our efforts validate the potential of artificial intelligence in medical document analysis, establishing a framework for future clinical decision support systems. --- :::warning ## Question :question: ::: --- :::success ## Thank You :::