owned this note
owned this note
Published
Linked with GitHub
HIDA COVID Alpha X Team Hackathon
===============
## Outcome
- **2. place** in Classification Challenge - congratulation to the whole team!
## Basic Info
International virtual COVID-data challenge
- HIDA, ELLIS, IDSI, Helmholtz AI
- https://www.helmholtz-hida.de/en/activities/events/details/international-virtual-covid-data-challenge/
- Repo: https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon
## Submission
- **IMPORTANT**: deadline shifted to **16:30**
- Collaborative slides link for everyone: https://docs.google.com/presentation/d/1V7TT8YwQFYeoM8lCg3h_bdIGeqz83DCAkWswpu3wnCs/edit?usp=sharing
- In case video is an option - the instructions for OBS workflow
- https://drive.google.com/file/d/1IXqOo9b3ugCspR7PATqhAsjhixyqIPg3/view
## Next meetup
- Proposal: Exchange before submission, 15:30
- Room: https://webconf.fz-juelich.de/b/jit-ffc-wyd
- 29.04, 10:15 https://bbb.hzdr.de/b/hel-wwv-d67
## Meetup Together Pre-Submission
- **IMPORTANT**: deadline shifted to **16:30**
- Meeting: 15:30
- Room : https://webconf.fz-juelich.de/b/jit-ffc-wyd
- Submission file
- Ultimate deadline for dropping all stuff in: 15:45
- HackathonCovidSubmissions (https://hmgubox2.helmholtz-muenchen.de/index.php/s/Hz6MKFXFw6nxFo7)
- presentation slides : https://docs.google.com/presentation/d/1V7TT8YwQFYeoM8lCg3h_bdIGeqz83DCAkWswpu3wnCs/edit?usp=sharing
### Discussion Presubmission
## Submission procedure
1. use /testSet/testSet.txt as template
2. Rename with a team name (Team_COVID_Alpha_X)
3. Task 1: Replace NaNs with imputed values
4. Task 2: Fill-in prognosis with MILD or SEVERE
Put into folder HackathonCovidSubmissions (https://hmgubox2.helmholtz-muenchen.de/index.php/s/Hz6MKFXFw6nxFo7) : before 16:00
- Slides with short explanation of the way to and solution itself
- Jenia: collaborative slides link for everyone: https://docs.google.com/presentation/d/1V7TT8YwQFYeoM8lCg3h_bdIGeqz83DCAkWswpu3wnCs/edit?usp=sharing
## Meetup Together Discussion 29.04
- Room : https://bbb.hzdr.de/b/hel-wwv-d67
- Submission file
- 1 hour before - 15:00; ultimate deadline: 15:30
- presentation slides : can be already started describing methods; results plugging in later
### Installing relevant packages
- like datawig, fancyimpute etc -
See here: https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon/blob/main/helpers/test.md
## Task 2 Blitz Discussion 29.04
Room: https://webconf.fz-juelich.de/b/jit-ffc-wyd
- Deal with missing images
- an option: use an additional mask vector, eg [0 1 .. 0], where each entry indicates whether input is missing or not
- only images missing : a 2-dim vector eg [0 1] for case when imaging is missing, table is on
- when image is missing: let image network feature vector become zeros (or any other reasonable dummy value), do not apply any input to image encoder
- full missing data mask : an n+1-dim vector (n-dim of the table vector) with those entries that are missing indicated by 0
- nearest neighbor on training set to replace missing image input
- first usual BCE loss for disease severity prediction only
- if works, try loss including clinical data ground truth vector as additional teacher signal
- Build classifier by fine-tuning on X-Ray set
- Plug into fused architecture (using mask vector as additional input?)
## Access and Resources
- Accesing the dataset https://hida-hackathon2021.scc.kit.edu/data/#accessing-the-dataset
- Github repo: https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon
- REMINDER: please put your **github names** here -->
- mehdidc, stkmrc, niclaspopp, hannahrabea, shravaniCD
## Dataset
Dataset: `/hkfs/work/workspace/scratch/ej4555-hida2021/HackathonCovidData`
- create shortcut : `ln -s /hkfs/work/workspace/scratch/ej4555-hida2021/HackathonCovidData dataset`
- Using the Dataset
We propose the following steps in order to handle the data sets:
Create a workspace
`ws_allocate hida_workspace 30`
Create a symlink to your workspace
`ln -s $(ws_find hida_workspace) $HOME/hida_workspace`
Copy the HackathonCovidData into your workspace
`cp -vr /hkfs/work/workspace/scratch/ej4555-hida2021/HackathonCovidData $HOME/hida_workspace/`
## Workflow
### Installing relevant packages
- like datawig, fancyimpute etc -
See here: https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon/blob/main/helpers/test.md
### Virtual env:
```bash
python -m venv hack_env
source hack_env/bin/activate
```
### Package install example:
1. Open New Terminal : File->New->Terminal
2. Create and source virtual env
```bash
python -m venv hack_env
source hack_env/bin/activate
```
3. Install a package: e.g `pip instal opencv-python`
### Working from Terminal in interactive mode
- Requires only ssh login, Jupyter not necessary (but can be used with Jupyter as well when starting Terminal there)
https://www.nhr.kit.edu/userdocs/horeka/Slurm%3A_Interactive_Jobs/
- examples of interactive sessions:
- Run interactive bash session with 1 GPU for 1 hour
-
```bash
srun --partition=haicore-gpu4 --gres=gpu:1 --time=01:00:00 --pty bash -i
```
- Run a script
```bash
sbatch --partition=haicore-gpu4 --gres=gpu:1 --time=01:00:00 --pty my_script.sh
```
### Creating a workspace
1. Create a workspace
`ws_allocate hida_workspace 30`
2. Create a symlink to your workspace
`ln -s $(ws_find hida_workspace) $HOME/hida_workspace`
## Team
* COVID Alpha X
* members
* Helene
* Ashish
* Hannah
* Marco
* Niclas
* Shravani
* Mehdi
* coach
* Jenia
* General
* Mehdi: metrics, evaluation, testing https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon/tree/main/code_container
* Task 1
* Helene --> using simpleImputer by datawig
* [Notebook for imputation](https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon/blob/main/code_testing/data_wig_impute/Notebook.ipynb)
* [imputations methods](https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon/blob/main/code_testing/data_wig_impute/impute_datawig.py)
* mean score using [tests.py](https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon/blob/main/code_container/tests.py): 0.33
* Marco -> using MICE package in R
* https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon/tree/main/MICE_Imputation
* Shravani
* Hannah --> using iterativeImpute by sklearn
* Task 2
* Mehdi -> first step: basic skeleton of repo, validation metrics
*
* ashish
* Niclas
[Grand Scheme Image](https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon/blob/main/images/scheme_1.png)
## Dataloaders
- X-Ray examples:
- have a look here - a lot of data readers already implemented: https://github.com/mlmed/torchxrayvision#dataset-tools
- https://github.com/mlmed/torchxrayvision/blob/master/torchxrayvision/datasets.py
## Discussion
- Work distribution
- Subteam for clinical data, subteam for images? Or concentrate all together on baseline solution for task 1 (clinical data, no images, missing values and main predictor disease severity) first?
### Task 1 (using clinical data)
#### Infos
- Test code for datawig data imputation:
https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon/blob/main/code_testing/data_wig_impute/test_impute_datawig.py
#### Material & Ideas
Ashish: A nice idea would to go through this link https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e there is a cool library called datawig (by AWS)
- Marco: Here is a quick overview of the data structure, how many values are missing:
| | NAs | total |
| -------- | -------- | -------- |
| PatientID | 0 | 863 |
| ImageFile | 0 | 863 |
| Hospital | 0 | 863 |
| Age | 1 | 863 |
| Sex | 0 | 863 |
| Temp_C | 154 | 863 |
| Cough | 5 | 863 |
| DifficultyInBreathing | 4 | 863 |
| WBC | 9 | 863 |
| CRP | 33 | 863 |
| Fibrinogen | 591 | 863 |
| LDH | 136 | 863 |
| Ddimer | 621 | 863 |
| Ox_percentage | 243 | 863 |
| PaO2 | 170 | 863 |
| SaO2 | 583 | 863 |
| pH | 207 | 863 |
| CardiovascularDisease | 19 | 863 |
| RespiratoryFailure | 159 | 863 |
| Prognosis | 0 | 86 |
- If somebody has a Jupyter notebook on clinical data exploration and sniplets of code to try - push it into [our repo](https://github.com/JeniaJitsev/HIDA_COVID_Alpha_X_hackathon), so that everyone can try it as well
Helene: I would like to try a linear model to predict the missing values based on the other patients who have those values. I think just using a distribution of the values over all patient would neglect the correlations between certain values. The simplest thing I can think of is to find 2-3 most important features for each feature to be predicted
Jenia: MICE (see below) could be a strong baseline for missing value problem, it seems straightforward to use and goes beyond a simple linear model approach
- it is build into `scikit-learn` as IterativeImputer https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html
- the library it originates from offers other impute methods as well: https://github.com/iskandr/fancyimpute
- Jenia: On missing data imputation baseline
- "Another algorithm of fancyimpute that is more robust than KNN is MICE(Multiple Imputations by Chained Equations). ** MICE ** performs multiple regression for imputing. Use the below code snippet to run MICE,
`from fancyimpute import IterativeImputer
mice_impute = IterativeImputer()
traindatafill = Mice_impute.fit_transform(traindata)`
IterativeImputer was merged into **scikit-learn** from fancyimpute. However, it can still be imported from fancyimpute."
https://github.com/iskandr/fancyimpute
- Jenia: more advanced approaches (beyond baseline) for missing value problem on clinical data would be generative models (GANs, VAEs), see here
- Advantage of such methods could be that one would be able to train end-to-end with an image encoder part on common end loss - but workshop is too short for that, I assume
- Example using GANs: GAIN - a very strong work (a lot of follow ups): Jinsung Yoon, James Jordon, Mihaela van der Schaar, "GAIN: Missing Data Imputation using Generative Adversarial Nets," International Conference on Machine Learning (ICML), 2018.
Paper Link: http://proceedings.mlr.press/v80/yoon18a/yoon18a.pdf
https://github.com/jsyoon0823/GAIN
https://github.com/dhanajitb/GAIN-Pytorch
- https://openaccess.thecvf.com/content_CVPR_2020/html/Yoon_GAMIN_Generative_Adversarial_Multiple_Imputation_Network_for_Highly_Missing_Data_CVPR_2020_paper.html
- Example using VAE: Handling Incomplete Heterogeneous Data using VAEs
- https://arxiv.org/abs/1807.03653
- "Variational autoencoders (VAEs), as well as other generative models, have been shown to be efficient and accurate for capturing the latent structure of vast amounts of complex high-dimensional data. However, existing VAEs can still not directly handle data that are heterogenous (mixed continuous and discrete) or incomplete (with missing data at random), which is indeed common in real-world applications. In this paper, we propose a general framework to design VAEs suitable for fitting incomplete heterogenous data. The proposed HI-VAE includes likelihood models for real-valued, positive real valued, interval, categorical, ordinal and count data, and allows accurate estimation (and potentially imputation) of missing data. Furthermore, HI-VAE presents competitive predictive performance in supervised tasks, outperforming supervised models when trained on incomplete data."
- https://github.com/probabilistic-learning/HI-VAE
https://github.com/probabilistic-learning/HI-VAE/blob/master/script_HIVAE.sh
- Advantage: can handle any kind of values - continuous, discrete, so would fit into mixed value case in clinical data here.
### Task 2 (using X-ray Images)
#### Info
- using 320x320 resizing (as in pre-trained CheXPert models)
#### Material
- https://github.com/albumentations-team/autoalbument gamechaning aug
- https://github.com/PyTorchLightning/pytorch-lightning/blob/master/notebooks/01-mnist-hello-world.ipynb pytorch lightning
- Ashish:https://github.com/facebookresearch/CovidPrognosis Mehdi shared this and it looks great we already get a pretrained model and we can fine tune. (by Facebook)
- Jenia: hint - this is unsupervised (self-supervised, via contrastive loss) pre-training, so it may be weaker and have more troubles than supervised pre-trained models (see below, eg BiT). However, unsupervised pre-training is cool, so worth trying. Maybe worth comparing a supervised pre-trained and a self-supervised pre-trained one.
- Ashish: https://github.com/jfhealthcare/Chexpert pretrained model available (top 5th solutions of https://stanfordmlgroup.github.io/competitions/chexpert/)
- Jenia: Further pre-trained models (on NIH ChestX-ray14) based on a strong paper (from Chicago hospital): https://github.com/IVPLatNU/deepcovidxr
- Marco: Review on performance of transfer learning for COVID 19 detection on X-ray images:
https://pubmed.ncbi.nlm.nih.gov/32773400/
- Marco: If we want to do lung segmentation, here are two pretrained models I found:
- https://github.com/haimingt/opacity_segmentation_covid_chest_X_ray
(Paper: https://www.medrxiv.org/content/10.1101/2020.10.19.20215483v1.full)
- https://github.com/raghavian/lungVAE
(Paper: https://arxiv.org/abs/2005.10052)
- Jenia: for transfer learning, downloading a pre-trained BiT model can be considered. BiT is currently state of the art in transfer (via supervised pre-training on classification task, ImageNet-1k, 21k).
- See papers here:
- Supervised Transfer Learning at Scale for Medical Imaging (Google Health) : https://arxiv.org/abs/2101.05913
- Big Transfer (BiT): General Visual Representation Learning: https://arxiv.org/abs/1912.11370 (original BiT paper, Google Research Labs Zuerich)
- and repo here https://github.com/google-research/big_transfer#how-to-fine-tune-bit
- Jenia: there are further pre-trained models (based on EfficientNet, VIT), also using X-Ray datasets for pretraining (CheXpert)
- DenseNet-121 based models pretrained on CheXpert or MIMIC-CXR, or other large X-Ray datasets, are here: https://github.com/mlmed/torchxrayvision#models-demo-notebook
## Schedule
### April 28, 2021
- 11:00 AM in Zoom
Welcome to the International COVID-DATA Challenge
Helmholtz Information & Data Science Academy
Helmholtz AI & Ellis Netzwerk
Israel Data Science Initiative Zoom
- 11:15 AM in Zoom
Networking session – Get to know each other
- 11:25 AM in Zoom
Introduction to our challenge incl. Q&A
Helmholtz AI
Ellis Netzwerk
BRACCO Imaging/Centro Diagnostico Italiano
- 11:50AM in Zoom
Introduction to the Computer resources by our partners
HAICORE, Karlsruhe Institute for Technology
NVIDIA
- 12:15 PM in Slack
Teambuilding via Slack & time for your challenge
- 4:00 PM in Zoom
"Coffee Break" - Short team introductions, time for questions & support by the Data Coaches
- 4:20-10:00 PM in Slack
More time for your challenges
### April 29, 2021
- 10:00 AM in Zoom
Welcome back: A brief overview of day #2
Helmholtz Information & Data Science Academy
- 10:10 AM in Slack
More time for your challenges and to prepare your submission and video
- 4:00 PM
Deadline: Submission of solution and video
- 4:30 PM in Zoom
Solutions for the ages – a short crash course on sustainable software development
Helmholtz-Zentrum Dresden-Rossendorf
- 5:00 PM in Zoom
Award ceremony & thank you
Please note: All times are in CEST