# 112-1 Intro to AI
Release Date: 2023/10/05
TA Hours: 2023/10/12 (Thr.) 13:00 - 14:00 (On site 知武401)
Due Date: 2023/10/17 21:00
**No late submissions will be accepted**
## Homework 2
###### tags:`1121iai`
In this assignment, we hope you can establish a **Machine Learning** **Ensemble Model** based on several **classification models** to **stratify** patients with AML (急性骨髓性白血病) into three risk groups based on non-clinician-initiated data from the NCRI cohort for training and validation, and an independent SG cohort for testing.
TA: 賴鴻昇 b08611048@ntu.edu.tw
Original code provider: Ming-Siang Chang (張名翔, Marcus)
<!--
Resource Documentation
Slides & Colab Link:
https://drive.google.com/drive/u/0/folders/1SytydSDeElSY-Gitr7A3pVOB-nrmmPhX
Github:
https://github.com/hardness1020/Medical_Data_Science_Tutorial
-->
## Introduction
### Medical AI
* Medical AI combines medicine, healthcare, data science, and computer science to analyze diverse medical data.
* The goal is to improve patient care and optimize healthcare using data-driven approaches.
### AML
* Acute myeloid leukemia (AML) is a fatal blood condition brought on by abnormal white blood cells and develops in the bone marrow.
* Personalized treatment strategies for AML are necessary, and they rely heavily on accurate risk stratification.

**Figure 1.** The simplified treatment paradigm of AML [1]
### ELN 2017
The European Leukemia Net (ELN) recommendations provide multiple pre-defined biomarkers for the diagnosis and management of AML. In this homework, we try to compare the results from our ensemble model to the ELN 2017 results.
### Non-Clinician data
* In this homework, risk stratification, we use Non-Clinician data as training input because it may lead to models that mirror existing clinical decisions and potentially render risk stratification redundant.

**Figure 2.** Clinician-initiated and non-clinician-initiated data are distinguished by their proximity as readouts of patients [2]
### Concordance Index (C-index)
The concordance index measures the accuracy of predictions in survival analysis, assessing how well predicted outcomes align with actual events. A higher C-index indicates better predictive accuracy.
https://zhuanlan.zhihu.com/p/485401349
### Ensemble Learning
A machine learning technique that involves combining multiple models to create a more robust and accurate predictor.
* **Individual models** that form the building blocks of an ensemble. They can be the same or different types, trained on the same or different subsets of data.
* Ensemble performance benefits from **diverse** base models that capture different aspects of the data and exhibit varying strengths and weaknesses.
* Ensemble models use different strategies to **aggregate** predictions from base models: Voting, Stacking, Bagging, Boosting, ...
## Sample Code and Data
<!--
* Remove HyperOpt
* Change loss function to simply log_loss
* (optional) Reformat # Decide which models you want to use
* trim repeated variable -> some could be global variable (e.g., CUTOFF)
-->
https://colab.research.google.com/drive/1wPda0wHZ6LRg5lXeiOGOaVeLsdR0A-1k?usp=sharing
## Instructions
**Please set all `random_state` to `2023` if models and functions have this parameter.**
Please follow the steps to build your own model!
1. Install Dependencies
2. Complete the code blocks with **TODO** comment
3. Run the first time to make sure it works! (Base: 60 points)
5. Try to run the following different train/validation ratios: 9:1, 8:2, 7:3, 6:4, 5:5. **Report** the best **ensemble model** result **on the testing set** using **C-index** as the benchmark. (+10 points)
6. Add missing classification models (Classification Models Requirement) from below, and make sure it works! (+20 points)
7. Using **8:2** train/validation ratio. **Report** the **C-index** ranking in classification models and the ensemble model **on validation set**. (+10 points)
(100 points! Congrats!)
---
**BONUS (+10 points): Please give me an independent python file**
Try everything you can (including `random_state`) to make your testing data (external cohort) have **C-index > 0.62** (Even without ensemble learning is acceptable!) Please **print out** your final C-index in your code.
### Classification Models Requirement
* Decision Tree
* Random Forest (100 trees)
* XGBoost
* SVC (In **linear** kernel, enable **probability**)
* KNN
### Performance Output
<div style="width: 120%; height: 120%;">
<img src="https://hackmd.io/_uploads/BJPaJXHbT.png" alt="Your Graph">
</div>
<div style="width: 120%; height: 120%;">
<img src="https://hackmd.io/_uploads/BkL7x7SWp.png" alt="Your Graph">
</div>
## Report Format
Please follow the format because we will use a program to grade your report. If the format is wrong but the answer is correct, you will get **minus 5 points** as a penalty.
> Noted that we will run your program to test; please make sure your program is runnable and in `.py` format.
>
Files Structure:
```
{student_id}_hw2 # In lowercase
├── {student_id}_hw2.txt
├── {student_id}_hw2.py
└── {student_id}_bonus.py
# Zip it and Summit to NTU COOL
-> {student_id}_hw2.zip
e.g.
b08611048_hw2
├── b08611048_hw2.txt
├── b08611048_hw2.py
└── b08611048_bonus.py # optional
-> b08611048_hw2.zip
```
Example Answer for `{student_id}_hw2.txt` (Please make sure you type the right name of models):
```
5:5
Ensemble Model > Decision Tree > Random Forest > XGBoost > SVC > KNN
```
## Reference
1. https://www.airitilibrary.com/Publication/alDetailedMesh1?DocID=U0001-0505202100205400
1. https://www.nature.com/articles/s41746-021-00426-3#MOESM1
1. Dataset: [Unified classification and risk-stratification in Acute Myeloid Leukemia.](https://www.nature.com/articles/s41467-022-32103-8#MOESM1)
* NCRI cohort (for training and validation)
* SG cohort (for testing, external cohort)
## Acknowledgement
Thank B06/R10 張名翔, for helping establish the framework and util scripts for this assignment.