112-1 Intro to AI (HW2)

# 112-1 Intro to AI Release Date: 2023/10/05 TA Hours: 2023/10/12 (Thr.) 13:00 - 14:00 (On site 知武401) Due Date: 2023/10/17 21:00 **No late submissions will be accepted** ## Homework 2 ###### tags:`1121iai` In this assignment, we hope you can establish a **Machine Learning** **Ensemble Model** based on several **classification models** to **stratify** patients with AML (急性骨髓性白血病) into three risk groups based on non-clinician-initiated data from the NCRI cohort for training and validation, and an independent SG cohort for testing. TA: 賴鴻昇 b08611048@ntu.edu.tw Original code provider: Ming-Siang Chang (張名翔, Marcus)  ## Introduction ### Medical AI * Medical AI combines medicine, healthcare, data science, and computer science to analyze diverse medical data. * The goal is to improve patient care and optimize healthcare using data-driven approaches. ### AML * Acute myeloid leukemia (AML) is a fatal blood condition brought on by abnormal white blood cells and develops in the bone marrow. * Personalized treatment strategies for AML are necessary, and they rely heavily on accurate risk stratification. ![](https://hackmd.io/_uploads/BkbhNoWp3.png) **Figure 1.** The simplified treatment paradigm of AML [1] ### ELN 2017 The European Leukemia Net (ELN) recommendations provide multiple pre-defined biomarkers for the diagnosis and management of AML. In this homework, we try to compare the results from our ensemble model to the ELN 2017 results. ### Non-Clinician data * In this homework, risk stratification, we use Non-Clinician data as training input because it may lead to models that mirror existing clinical decisions and potentially render risk stratification redundant. ![](https://hackmd.io/_uploads/ByciVjW62.png) **Figure 2.** Clinician-initiated and non-clinician-initiated data are distinguished by their proximity as readouts of patients [2] ### Concordance Index (C-index) The concordance index measures the accuracy of predictions in survival analysis, assessing how well predicted outcomes align with actual events. A higher C-index indicates better predictive accuracy. https://zhuanlan.zhihu.com/p/485401349 ### Ensemble Learning A machine learning technique that involves combining multiple models to create a more robust and accurate predictor. * **Individual models** that form the building blocks of an ensemble. They can be the same or different types, trained on the same or different subsets of data. * Ensemble performance benefits from **diverse** base models that capture different aspects of the data and exhibit varying strengths and weaknesses. * Ensemble models use different strategies to **aggregate** predictions from base models: Voting, Stacking, Bagging, Boosting, ... ## Sample Code and Data  https://colab.research.google.com/drive/1wPda0wHZ6LRg5lXeiOGOaVeLsdR0A-1k?usp=sharing ## Instructions **Please set all `random_state` to `2023` if models and functions have this parameter.** Please follow the steps to build your own model! 1. Install Dependencies 2. Complete the code blocks with **TODO** comment 3. Run the first time to make sure it works! (Base: 60 points) 5. Try to run the following different train/validation ratios: 9:1, 8:2, 7:3, 6:4, 5:5. **Report** the best **ensemble model** result **on the testing set** using **C-index** as the benchmark. (+10 points) 6. Add missing classification models (Classification Models Requirement) from below, and make sure it works! (+20 points) 7. Using **8:2** train/validation ratio. **Report** the **C-index** ranking in classification models and the ensemble model **on validation set**. (+10 points) (100 points! Congrats!) --- **BONUS (+10 points): Please give me an independent python file** Try everything you can (including `random_state`) to make your testing data (external cohort) have **C-index > 0.62** (Even without ensemble learning is acceptable!) Please **print out** your final C-index in your code. ### Classification Models Requirement * Decision Tree * Random Forest (100 trees) * XGBoost * SVC (In **linear** kernel, enable **probability**) * KNN ### Performance Output <div style="width: 120%; height: 120%;"> <img src="https://hackmd.io/_uploads/BJPaJXHbT.png" alt="Your Graph"> </div> <div style="width: 120%; height: 120%;"> <img src="https://hackmd.io/_uploads/BkL7x7SWp.png" alt="Your Graph"> </div> ## Report Format Please follow the format because we will use a program to grade your report. If the format is wrong but the answer is correct, you will get **minus 5 points** as a penalty. > Noted that we will run your program to test; please make sure your program is runnable and in `.py` format. > Files Structure: ``` {student_id}_hw2 # In lowercase ├── {student_id}_hw2.txt ├── {student_id}_hw2.py └── {student_id}_bonus.py # Zip it and Summit to NTU COOL -> {student_id}_hw2.zip e.g. b08611048_hw2 ├── b08611048_hw2.txt ├── b08611048_hw2.py └── b08611048_bonus.py # optional -> b08611048_hw2.zip ``` Example Answer for `{student_id}_hw2.txt` (Please make sure you type the right name of models): ``` 5:5 Ensemble Model > Decision Tree > Random Forest > XGBoost > SVC > KNN ``` ## Reference 1. https://www.airitilibrary.com/Publication/alDetailedMesh1?DocID=U0001-0505202100205400 1. https://www.nature.com/articles/s41746-021-00426-3#MOESM1 1. Dataset: [Unified classification and risk-stratification in Acute Myeloid Leukemia.](https://www.nature.com/articles/s41467-022-32103-8#MOESM1) * NCRI cohort (for training and validation) * SG cohort (for testing, external cohort) ## Acknowledgement Thank B06/R10 張名翔, for helping establish the framework and util scripts for this assignment.