# 2022, NTU/TVGH ICD coder FL progress
2022/8/4
全部liver_CT data已上傳至AIOT
路徑:/ifs/isilond/liver_CT/HCCCT/

2022/7/11
1. NTU FL Source Code :
https://vghteams-my.sharepoint.com/personal/ycchu5_vghtpe_gov_tw/_layouts/15/onedrive.aspx?ga=1&id=%2Fpersonal%2Fycchu5%5Fvghtpe%5Fgov%5Ftw%2FDocuments%2F2022%2C%20%E8%B3%B4%E9%A3%9B%E7%BE%86%2C%20FL%2C%20%E5%8F%B0%E5%A4%A7%2C%20%E5%8C%97%E6%A6%AE%2C%20%E4%BA%9E%E6%9D%B1%2C%20ICD%20coder%2F2022%2C%200711%2C%20run%5Ftrainer%5FFL
1. **Time**: 2018~2020
1. **Input**:VGHTPE 出院病摘
1. **Output**:ICD-10
1. **Train number**: 252,282(80%)
1. **Test number**: 31,253(10%)
2. **Validation number**: 31,253(10%)
1. **NV DGX-1 GPU**:V100*1, batch size=16

1. **leadtek AIOT GPU**:V100*1, batch size=32

1. 賴老師:
北榮 用了 8個GPU的機器來訓練DNN model
得到的結果 比 用一個 GPU 的結果還要差。
此與常理不合,執行時間沒有改善就算了,但結果比較差,完全無法理解?
2022/7/6
FL data & label路徑
train : /home/vghtpe/FL/icd-10/data/CM/train.csv
test : /home/vghtpe/FL/icd-10/data/CM/test.csv
label : /home/vghtpe/FL/icd-10/label/CM.csv
增加liver_CT data
位置:/raid/liver_CT/2017

2022/6/30
1.DGX-1 GPU:V100*1

2.leadtek GPU:V100*1

2022/6/29
FL model
python run_trainer.py --fp16 --task CM --train

input data

label

2022/6/23
Chia-Wei Ao
/raid/liver_CT/EDAH_2004-2020_AllData.xlsx

/raid/liver_CT/EDAH_2004-2020_AllData_output.xlsx


Warren Tseng
/raid/liver_CT/EDDH/

/raid/liver_CT/EDCH1/

2022/6/21
NV poc
題目/目標/進度/資源/issues
題目: Liver CT+NLP,Classification
Data Source: Liver CT+Free text reports
- [ ] 1. NeMo (Chia-Wei Ao 5/30)
Issues:Text classification 使用 BioMegatron Model
https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/Text_Classification_Sentiment_Analysis.ipynb
目前model無法使用
- [ ] 2. Monai,liver (Warren Tseng)
中華電信完成
https://www.notion.so/DGX-1-3D-Slicer-76f11308dc6e4d5dacdea4b6e63fbaf9
參考文獻
https://www.kaggle.com/datasets/andrewmvd/lits-png
https://www.frontiersin.org/articles/10.3389/fonc.2021.669437/full





- [x] 3. Parabricks (Kuan-Liang Liu)
後續要完成正常細胞與癌細胞基因比較
https://developer.nvidia.com/blog/scale-cancer-genome-sequencing-analysis-and-variant-annotation-using-nvidia-clara-parabricks-3-8/
6/15已完成
1. https://docs.google.com/presentation/d/1EECfxkqSlq5NnHYL6_JGD2IwTpU4yGB0YScH9n6YNZw/edit#slide=id.g133fc76a4ca_0_0
2. https://docs.google.com/spreadsheets/d/1ve9bm1nUqiu1Ta2ek9W676mibSEcBUJ1uF-4hR8q8So/edit#gid=1476765300
2022/6/21
1.leadtek GPU:V100*1
F1:0.50 Epoch:60

2.DGX-1 GPU:V100*1
F1:0.45 Epoch:10

2022/6/20
跑8張GPU結果
F1:0.479
Epoch:66


2022/6/18
1. 北榮跑多張GPU, F1還是0

2. 泰良建議: batch size可以設大一點,從每張gpu 32開始往下測試,10個epoch內f1都是0的話就可以停止了
3. 泰良建議:用單張GPU跑看看,run_trainer_FL有指定單張GPU,batch size設32,指令為python run_trainer_FL.py
4. 
5. [使用的程式如URL](https://vghteams-my.sharepoint.com/:f:/g/personal/ycchu5_vghtpe_gov_tw/EqfvggcswEdAjljc4ifV-4wBLz72U_HU5KIIUfqRwacDRw?e=Lh5hg0)
2022/6/14 Yuanchia
1.train

2022/6/10 Yuanchia
1.train

2.val

3.external test

4.running

2022/5/27 Yuanchia
1. 泰良給新的程式, 產生的train.csv為training set; test.csv為validation set; external_test.csv為testing set;訓練過程不會用到external_test.csv, [連結](https://vghteams-my.sharepoint.com/:u:/g/personal/ycchu5_vghtpe_gov_tw/EYBMKdpNed5BkUAQPjxz5mUBAqS2NEdnAoa7TliSZfsxrg?e=UZTpir)
PS:1. 提供切8:1:1的preprocess code 更新原本 preprocess
請移除 label是空白的資料,照原本的。因為這些文字應該不是沒有對應的 ICD code,而是其他原因沒有做申報,這樣丟進去會讓模型學習錯誤
4. Peifu, 切資料集做法前是資料是照時間遠到近排序(住院時間 ascending)
Sky: 好的, '住院日期'改成北榮的欄位名字

3.泰良:設定FL連線, nc -v 140.112.30.196 9999
SKY:星期一和網路組對一下 10.221.252.173(DGX1)->140.112.30.196 9999
2022/5/21 Yuanchia
1. 討論後老師建議不要刪去重複住院,每次住院都視為單一事件,再請學長看這樣資料量有多少。(亞東四年約十萬筆,台大四年約二十萬筆,北榮只有73682?)
Sky: 找到問題, 原本滙入到研究資料的資料只更新到2018/5, 我想更新後應該可以解決, 等新資料集整理後, 我們在討論是不是要改。
3. 之後架設好 FL server 後,還是需要同步重新 training 一次,(每5個 epoch 會平均一次 參數權重,這樣會影響接下來的 traning )
Sky: 好的, 請聖哲協助。
5. 聖哲有發現 聯集的 label list 有個來自北榮的 label 有點問題,好像是 ICD-9 的,他會再與您確認處理這部分。
Sky: 好的, 請聖哲協助。
7. 之後正式訓練時,會先將資料以時間依照 1:4 切出 testing (hold-out) set,再拿那 4/5 來做 train+validation。
Sky: 請提供要修正的程式。
2022/5/20 Peifu
1. 討論後老師建議不要刪去重複住院,每次住院都視為單一事件,再請學長看這樣資料量有多少。(亞東四年約十萬筆,台大四年約二十萬筆,北榮只有73682?)
3. 之後架設好 FL server 後,還是需要同步重新 training 一次,(每5個 epoch 會平均一次 參數權重,這樣會影響接下來的 traning )
5. 聖哲有發現 聯集的 label list 有個來自北榮的 label 有點問題,好像是 ICD-9 的,他會再與您確認處理這部分。
7. 之後正式訓練時,會先將資料以時間依照 1:4 切出 testing (hold-out) set,再拿那 4/5 來做 train+validation。
2022/5/18 Yuanchia
1. 北榮共有73682筆,看起來大約有23000資料是病人多次出入院的紀錄,其實是沒有重複,邏輯只留最後一次出入院的報告,這個模型沒有處理時間序列的問題嗎? 陳沛甫:沒有處理時間序列,我們也是把重複的刪去。
2. 亞東的f1為0.79,原本為0.80。台大的f1為0.75,原本為0.77 (「原本」指的是用自己醫院train+validation,資料混合的f1比較高),by 何泰良
2022/5/17 Yuanchia
1. 北榮, 出院病摘, 2016~2020
1. raw data: 73682
2. 去重覆後raw data:total:50640, train:40842, test:9798
3. data clean: total:48086, train:38744, test:9342
```
{
"epoch": 100.0,
"eval_f1": 0.6025973064969661,
"eval_loss": 168.53293600171463,
"eval_precision": 0.7142794491711253,
"eval_recall": 0.5211172969859857,
"eval_roc_auc_score": 0.7605350259521738,
"eval_runtime": 74.3425,
"eval_samples_per_second": 125.648,
"eval_steps_per_second": 0.982
}
```

2022/5/16 by 聖哲
1. 只抓北榮"出院診斷"之正規表示式,已完成如連結 [link], 完成(https://drive.google.com/file/d/1rjaaW2L1vmHRhp3Y9GTcgCQ_PhawOEJ8/view?usp=sharing)

2022/5/13 by 沛甫
1. 產生北榮label list, 完成後傳給沛甫, [link](https://www.dropbox.com/s/m7cxtiev2zypmb3/preprocess_lables.py?dl=0)
2. 請沛甫補一支程式, 只抓北榮"出院診斷"之正規表示式,已完成如連結 [link](https://www.dropbox.com/s/amv11h2n212f20i/preprocess_txt.py?dl=0)

2022/5/13 Yuanchia
1.NeMo + ICD classification demo video
ref: https://www.youtube.com/watch?v=m2pvdcvQUus&list=PL5uCDOVJqgeuaB0i1MbVS0k2mW83Pmxf5&index=6
2.The example of Text classification should be more similar, but the model is recommended to be replaced by BioBERT or BioMegatron:
ref:https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/Text_Classification_Sentiment_Analysis.ipynb
3.NVIDIA FLARE套件,成功讓兩個client各自訓練的部分藉由server同步。
https://www.notion.so/NVIDIA-FLARE-0faaf13495ea4991997b7c7190c48fe2
2022/5/13 Jueni

目前在DGX1訓練中 (如果有多張GPU卡, GPU = 0,1,2,3..., 試試看能不能支援同時一起跑)

---
DGX1 Package
```
Package Version Editable project location
----------------------------- ------------------------- -------------------------
absl-py 1.0.0
alabaster 0.7.12
alembic 1.7.7
apex 0.1
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
astroid 2.11.2
asttokens 2.0.5
attrs 21.4.0
audioread 2.1.9
Babel 2.9.1
backcall 0.2.0
backports.functools-lru-cache 1.6.4
beautifulsoup4 4.10.0
black 22.1.0
bleach 4.1.0
blis 0.7.5
brotlipy 0.7.0
cachetools 5.0.0
catalogue 2.0.6
certifi 2021.10.8
cffi 1.15.0
chardet 4.0.0
charset-normalizer 2.0.9
click 8.0.3
cloudpickle 2.0.0
codecov 2.1.12
colorama 0.4.4
commonmark 0.9.1
conda 4.11.0
conda-build 3.21.8
conda-package-handling 1.7.3
configparser 5.2.0
coverage 6.3.1
cryptography 36.0.1
cucim 22.2.1
cudf 21.12.0a0+293.g0930f712e6
cugraph 21.12.0a0+95.g4b8c1330
cuml 21.12.0a0+116.g4ce5bd609
cupy-cuda115 9.6.0
cycler 0.11.0
cymem 2.0.6
Cython 0.29.27
dask 2021.11.2
dask-cuda 21.12.0
dask-cudf 21.12.0a0+293.g0930f712e6
databricks-cli 0.16.4
dataclasses 0.8
debugpy 1.5.1
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.4
distributed 2021.11.2
docker 5.0.3
docker-pycreds 0.4.0
docutils 0.16
einops 0.4.1
entrypoints 0.3
executing 0.8.2
expecttest 0.1.3
fastrlock 0.8
filelock 3.4.2
fire 0.4.0
flake8 4.0.1
flake8-polyfill 1.0.2
Flask 2.0.3
fonttools 4.29.1
fsspec 2022.1.0
future 0.18.2
gdown 4.4.0
gitdb 4.0.9
GitPython 3.1.27
glob2 0.7
google-auth 2.6.0
google-auth-oauthlib 0.4.6
graphsurgeon 0.4.5
greenlet 1.1.2
grpcio 1.43.0
gunicorn 20.1.0
HeapDict 1.0.1
huggingface-hub 0.4.0
hypothesis 4.50.8
idna 3.1
imagecodecs 2022.2.22
imageio 2.16.1
imagesize 1.3.0
importlab 0.7
importlib-metadata 4.11.1
importlib-resources 5.4.0
iniconfig 1.1.1
ipykernel 6.9.0
ipython 8.0.1
ipython-genutils 0.2.0
isort 5.10.1
itk 5.2.1.post1
itk-core 5.2.1.post1
itk-filtering 5.2.1.post1
itk-io 5.2.1.post1
itk-numerics 5.2.1.post1
itk-registration 5.2.1.post1
itk-segmentation 5.2.1.post1
itsdangerous 2.0.1
jedi 0.18.1
Jinja2 3.0.3
joblib 1.1.0
json5 0.9.6
jsonschema 4.4.0
jupyter-client 7.1.2
jupyter-core 4.9.1
jupyter-tensorboard 0.2.0
jupyterlab 2.3.2
jupyterlab-pygments 0.1.2
jupyterlab-server 1.2.0
jupytext 1.13.7
kiwisolver 1.3.2
langcodes 3.3.0
lazy-object-proxy 1.7.1
libarchive-c 4.0
libcst 0.4.1
librosa 0.9.0
llvmlite 0.36.0
lmdb 1.3.0
locket 0.2.1
Mako 1.2.0
Markdown 3.3.6
markdown-it-py 1.1.0
MarkupSafe 2.0.1
matplotlib 3.5.1
matplotlib-inline 0.1.3
mccabe 0.6.1
mdit-py-plugins 0.3.0
mistune 0.8.4
mlflow 1.24.0
mock 4.0.3
monai 0.8.1+184.gaf725d75 /opt/monai
msgpack 1.0.3
murmurhash 1.0.6
mypy 0.942
mypy-extensions 0.4.3
nbclient 0.5.11
nbconvert 6.4.2
nbformat 5.1.3
nest-asyncio 1.5.4
networkx 2.6.3
nibabel 3.2.2
ninja 1.10.2.3
nltk 3.7
notebook 6.4.1
numba 0.53.1
numpy 1.22.2
nvidia-dali-cuda110 1.10.0
nvidia-pyindex 1.0.9
nvtx 0.2.4
oauthlib 3.2.0
onnx 1.10.1
openslide-python 1.1.2
packaging 21.3
pandas 1.3.5
pandocfilters 1.5.0
parameterized 0.8.1
parso 0.8.3
partd 1.2.0
pathspec 0.9.0
pathtools 0.1.2
pathy 0.6.1
pep8-naming 0.12.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.0.0
pip 22.0.4
pkginfo 1.8.2
platformdirs 2.4.1
pluggy 1.0.0
polygraphy 0.33.0
pooch 1.6.0
preshed 3.0.6
prettytable 3.1.0
prometheus-client 0.13.1
prometheus-flask-exporter 0.19.0
promise 2.3
prompt-toolkit 3.0.26
protobuf 3.19.4
psutil 5.9.0
ptyprocess 0.7.0
pure-eval 0.2.2
py 1.11.0
pyarrow 5.0.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pybind11 2.9.1
pycocotools 2.0+nv0.6.0
pycodestyle 2.8.0
pycosat 0.6.3
pycparser 2.21
pydantic 1.8.2
pydot 1.4.2
pyflakes 2.4.0
Pygments 2.11.2
pylint 2.13.4
pynvml 11.4.1
pyOpenSSL 21.0.0
pyparsing 3.0.7
pyrsistent 0.18.1
PySocks 1.7.1
pytest 6.2.5
pytest-cov 3.0.0
pytest-pythonpath 0.7.4
python-dateutil 2.8.2
python-hostlist 1.21
python-nvd3 0.15.0
python-slugify 5.0.2
pytorch-ignite 0.4.8
pytorch-quantization 2.1.2
pytype 2022.3.29
pytz 2021.3
PyWavelets 1.3.0
PyYAML 6.0
pyzmq 22.3.0
querystring-parser 1.2.4
recommonmark 0.6.0
regex 2022.1.18
requests 2.26.0
requests-oauthlib 1.3.1
resampy 0.2.2
revtok 0.0.3
rmm 21.12.0a0+31.g0acbd51
rsa 4.8
ruamel-yaml-conda 0.15.80
sacremoses 0.0.47
scikit-image 0.19.2
scikit-learn 0.24.0
scipy 1.7.1
Send2Trash 1.8.0
sentry-sdk 1.5.12
setuptools 59.5.0
shellingham 1.4.0
shortuuid 1.0.9
six 1.16.0
smart-open 5.2.1
smmap 5.0.0
snowballstemmer 2.2.0
sortedcontainers 2.4.0
SoundFile 0.10.3.post1
soupsieve 2.3.1
spacy 3.2.1
spacy-legacy 3.0.8
spacy-loggers 1.0.1
Sphinx 3.5.3
sphinx-autodoc-typehints 1.11.1
sphinx-glpi-theme 0.3
sphinx-rtd-theme 0.5.2
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
SQLAlchemy 1.4.32
sqlparse 0.4.2
srsly 2.4.2
stack-data 0.1.4
subprocess32 3.5.4
tabulate 0.8.9
tblib 1.7.0
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorboardX 2.5
tensorrt 8.2.3.0
termcolor 1.1.0
terminado 0.13.1
testpath 0.5.0
text-unidecode 1.3
thinc 8.0.13
threadpoolctl 3.1.0
tifffile 2022.3.25
tokenizers 0.12.0
toml 0.10.2
tomli 2.0.1
toolz 0.11.2
torch 1.11.0a0+17540c5
torch-tensorrt 1.1.0a0
torchtext 0.12.0a0
torchvision 0.12.0a0
tornado 6.1
tqdm 4.62.3
traitlets 5.1.1
transformers 4.17.0
treelite 2.1.0
treelite-runtime 2.1.0
typed-ast 1.5.2
typer 0.4.0
types-pkg-resources 0.1.3
types-PyYAML 6.0.5
typing_extensions 4.0.1
typing-inspect 0.7.1
ucx-py 0.21.0a0+37.gbfa0450
uff 0.6.9
urllib3 1.26.7
wandb 0.12.1
wasabi 0.9.0
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 1.3.2
Werkzeug 2.0.3
wget 3.2
wheel 0.37.0
wrapt 1.14.0
xgboost 1.5.0
zict 2.0.0
zipp 3.7.0
```
2022/5/13 Yuanchia
1. pip torch=1.9.1 (TVGH not install torch(version is 1.9.1) in IBM AC922)

---
2022/5/12 沛甫
1. input text 先只用出院診斷(discharge notes),未來再使用其他部分.
2. ICD-10-CM codes 各院間要做聯集.
---
2022/5/12 Yuanchia
1. The NLP source code of National Taiwan University(NTU) has been obtained.
2. The information of Taipei Veterans General Hospital(TVGH) has been completed, including the discharge notes, comorbidities, and pathology reports from 2016 to 2020.
3. The model training of TVGH hospital is in progress.
4. NTU FL requirements:
python= (TVGH=3.7.6)
torch=1.9.1 (TVGH=1.3.1)
transformers=4.11.3
numpy=1.20.3
pandas=1.3.3
wandb=0.12.1 (TVGH=0.12.16)
scipy=1.7.1
scikit-learn
tqdm
