Internship notes

# Internship notes ## Preinternship analysis ### Resources [LUNA16 Challenge](https://luna16.grand-challenge.org/) ### Challenge - Nodule extraction - false positive reduction ### TO DO - [x] Dataset analysis - open volume and mask using snapitk or Slicer - [ ] Read the solution papers for Nodule extraction - [ ] Read the solution papers for false positive reduction ### Tools for 3D image segmentation - Two tools where proposed. ITK-SNAP and 3D Slicer. The comparison of the both can be viewed [here](https://www.imaios.com/en/resources/blog/itk-snap-vs-3d-slicer). 1. [SNAP-ITK(3.8.0)](http://www.itksnap.org/pmwiki/pmwiki.php): focuses only on segmentation. Therefore faster. Supports DICOM. 1. Download the 3.8.0 tar.gz file from the official [link](https://sourceforge.net/projects/itk-snap/files/itk-snap/3.8.0/). 2. install QT libraries ```sudo apt install qt5-default``` 3. install libpng12-0 ``` sudo add-apt-repository ppa:linuxuprising/libpng12 sudo apt update sudo apt install libpng12-0 ``` 4. Go to the directory bin/ and run ```sudo ./itksnap``` 2. [3D Slicer](https://www.slicer.org/): has more features, is more known but has a mild learning curve. 1. Download from the official website 2. Extract and run slicer from the root directory of the extracted folder ### Dataset analysis - Dataset(in DICOM format) is extracted from LIDC/IDRI database. For getting the dataset, - First download the [NBIA data retriever](https://wiki.cancerimagingarchive.net/display/NBIA/NBIA+Data+Retriever+FAQ#:~:text=DEB%C2%A0(tested%20on%20Ubuntu)%0ATo%20run%20this%20file%2C%20type%20the%20following%20at%20the%20command%20prompt%3A%0Asudo%20%2DS%20dpkg%20%2Dr%20nbia%2Ddata%2Dretriever%2D4.3.deb%3Bsudo%20%2DS%20dpkg%20%2Di%20nbia%2Ddata%2Dretriever%2D4.4.deb) since we will be dealing with a TCIA file. - For installing the NBIA data retriever, ```sudo dpkg -i ./nbia*``` - Go to *[Cancer Imageing Archieve](https://www.cancerimagingarchive.net/nbia-search/?CollectionCriteria=LIDC-IDRI)* website and filter with CT. - Download the query (in TCIA format) and open it(directly double clicking), it should open the NBIA data retriever. - **Analysis**: each folder represents a 3D image. Inside each 3D image folder, we have multiple **DICOM**(.dcm) files representing every single 2D image slice and an **XML** document containing Radiologist Annotations/Segmentations. - [LUNA16 Dataset](https://luna16.grand-challenge.org/Data)(in mhd format) - Why this dataset? - LIDC offers more scans, but it seems that it's mostly the scans which have some inconsistencies in things like slice distance or are very coarse (slice thickness >= 3mm) that were removed for the LUNA16 subset. - On the LUNA16 site there's [code](https://rumc-gcorg-p-public.s3.amazonaws.com/f/challenge/71/2284573c-becc-4582-975f-633409155514/SimpleITKTutorial.pdf) for reading the .mhd files, which kind of removes the .dcm advantage. - For downloading the dataset, go to the [link](https://luna16.grand-challenge.org/Download/) and download all the files and extract the zip files. - For some reason the zip archive shows corrupted in Lambda. It is not, use some other device to extract. - **Data overview**: The original data from luna16 are consist of below: - subset0.zip to subset9.zip: 10 zip files which contain all CT image - annotations.csv: csv file that contains the annotations used as reference standard for the 'nodule detection' track - sampleSubmission.csv: an example of a submission file in the correct format - candidates_V2.csv: csv file that contains the candidate locations for the ‘false positive reduction’ track ### Top performers ([notes](https://docs.google.com/document/d/1T-gYxPhDaPd9TGXwY9P4GHkG1sWoPVqRiwTrUG07p_w/edit?usp=sharing)) 1. [3DCNN for Lung Nodule Detection And False Positive Reduction](https://rumc-gcorg-p-public.s3.amazonaws.com/f/challenge/71/8ac994bc-9951-420d-a7e5-21050c5b4132/20180102_081812_PAtech_NDET.pdf) (1st in both challenges) - Motivated by [Feature Pyramid Networks](https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c) - use two [3DCNN classifiers](https://keras.io/examples/vision/3D_image_classification/) to filter out false positives among the candidates detected by the detector network - using [focal loss funtion](https://www.analyticsvidhya.com/blog/2020/08/a-beginners-guide-to-focal-loss-in-object-detection/#:~:text=comes%20to%20rescue.-,Focal%20loss%20explanation,focusing%20parameter%20%CE%B3%E2%89%A50.) instead of cross-enropy to deal with imbalanced classes 2. [3D Deep Convolution Neural Network Application in Lung Nodule Detection on CT Images](https://rumc-gcorg-p-public.s3.amazonaws.com/f/challenge/71/bea787d4-5cb3-4669-a48b-caa0a3048d66/20171128_034629_LUNA16FONOVACAD_NDET.pdf) (3rd in Nodule, 2nd in false positive) 3. [Deep Convolution Neural Networks for Pulmonary Nodule Detection in CT imaging](https://rumc-gcorg-p-public.s3.amazonaws.com/f/challenge/71/101b150c-88c5-4f9d-a374-d8d3fc166aff/20171222_073722_JianpeiCAD_NDET.pdf) (2nd in Nodule, 5th in false positive) --- ## 6 March 2023 ### Starting up ([Rough layout](https://imgur.com/a/oKYXzfT)) - Logged on to the laptop, connected teams, outlook, fortinet(VPN to connect when not on the local network), installed IDE and work Environments ([setup](https://almdt-my.sharepoint.com/:w:/r/personal/vankhoa_median365_com/_layouts/15/Doc.aspx?sourcedoc=%7B68A32DA1-17D0-42E0-89DD-7D32C80144DB%7D&file=Lambda%20report.docx&action=default&mobileredirect=true)). - Rough idea about the **schedule**. - Read papers and documents - 2 weeks - Replicate solutions and verify - 2 weeks - Maybe start working with the internal data - In June, start writing a report or a paper if I implemented a new idea. - Agenda: - Focusing on false positive reduction - A basic layout of the architecture was also discussed. ### Resources - tools for easy collabration and sharing - [Trello](https://trello.com/b/wIEohO1f/ai-data-science): weekly meetings, slides, notes for each members. Sign in using "yash.agarwalla@mediantechnologies.com". - [MonADP](https://mon.adp.com/static/redbox/login.html?TYPE=33554433&REALMOID=06-000e81a2-1e5a-10db-a395-e14c0b810000&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=-SM-Uz3MmFlMtYXbbs4PyUg%2b8Qpa4qQTgAzkbzALwDsE1viMXj0fBx4BsHy3OVrQYonU&TARGET=-SM-HTTPS%3a%2f%2fmon%2eadp%2ecom%2fredbox%2flogin%2ehtml%3fREASON%3dERROR_TIMEOUT): Login successful - [Teams](https://teams.microsoft.com/_#/modern-calling/) and [Outlook](https://outlook.office365.com/mail/): Sign in using "yash@median365.com" - [Timesheet](https://timesheet.mediantechnologies.com/index.php?page=accueil#introduction): Ask Anne for login - [heh](http://heh.median.cad/): Git hosting website by Median - [Service portal](https://mediantechnologies.service-now.com/sp): Make service request to the IT dept. - Related to learning: - [Medical Image Topics by *The AI Summer*](https://theaisummer.com/medical-image-coordinates/): Browse through the website and try to understand the concepts. - [MONAI](https://github.com/Project-MONAI/MONAI/tree/66422f805326a1034e040a9eacc2c8220717052e): PyTorch-based, open-source framework for deep learning in healthcare imaging, part of PyTorch Ecosystem. Check out some [tutorials](https://github.com/Project-MONAI/tutorials). - Try to learn how to use it to train models, load data etc. - [SimpleITK](https://simpleitk.org/): multi-dimensional image analysis. - Try to learn and implement. Try to load data, preprocess the data and resample - [Preprocessing Library by Tutor](http://heh.median.cad/iBiopsyDS/preprocessing_chain/src/branch/master): - Try to understand how to implement and then implement. - To install ``` pip install -e git+http://heh.median.cad/iBiopsyDS/preprocessing_chain.git#egg=volproc ``` ### TODO - [x] Try to install SnapITK and Slicer to visualize the 3D images - [ ] Browse [The AI Summer](https://theaisummer.com/medical-image-coordinates/) - [ ] Finish last week's work - [ ] Analyse the code from the github repo by the top performer team - [ ] Try to get familiar with MONAI and SimpleITK --- ## "The AI Summer" Articles ([notes](https://docs.google.com/document/d/10mhw-6zwg0RIuj3yR9SN2deVc3B-53FEfoD6f8GqDFs/edit?usp=sharing)) ### [Medical Image Coordinates](https://theaisummer.com/medical-image-coordinates/) - **Coordinate systems**: World,Anatomical(Axial, Sagittal, Coronal plane), Medical(Voxel) - Moving between worlds(**Affine Transformations**: geometric map to voxal space) - Moving between Modalities(by affine and inverse affine transformations) - **DICOM:** File format and network layer - To handle Dicom files and folers, we have a python library called [pyDICOM](https://pydicom.github.io/pydicom/stable/) - To convert to Nifty format, we use [dcm2niix](https://github.com/rordenlab/dcm2niix) - Converting to Nifty makes it easier to import the dataset. After we have the Nifty(.nii) file, we can use the library [nibabel](https://nipy.org/nibabel/) to load the data. - We can transform to RAS or canonical coordinate system using the function ```nib.as_closest_canonical``` ### [Preprocessing and Augmentations](https://theaisummer.com/medical-image-processing/) - Medical image resizing (down/up-sampling) - Medical image rescaling (zoom- in/out) - Medical image rotation - Medical image flipping - Medical image shifting (displacement) - 3D cropping - Clip intensity values (outliers) - Intensity normalization in medical images - Elastic deformation ## 13 March 2023 ### TO DO - [x] convert into cubes/patches individual or group - [x] Save into format "serie_uid_detid_SxSxS.nii.gz" - [x] clip (-600 - some number that makes sense) - done using ```sitk.IntensityWindowing(patch, -1000, -600)``` - [x] normalize the value to 0-1. No problem visualizing in snapITK or slicer - done using ``` rescaler = sitk.RescaleIntensityImageFilter() rescaler.SetOutputMaximum(1.0) rescaler.SetOutputMinimum(0.0) lung_rescaled = rescaler.Execute(lung_clip) ``` - [x] map world cordinates to index coordinates - [x] use a different interpolator while resampling as linear is too basic (used cubic spline) - [x] use multithreading for preprocessing, maybe gpu acceleration as well? - [ ] for data augmentation use monai or try to understand how winner did. - [ ] try to finish understanding preprocessing this week so that i can focus more on deep learning next week. - [ ] Try to run the evaluation script - [x] [resample](http://heh.median.cad/iBiopsyDS/preprocessing_chain/src/branch/master/volproc/preprocessor/resample.py) volumes ([multi threading](http://heh.median.cad/iBiopsyDS/preprocessing_chain/src/branch/master/volproc/preprocessor/preprocessor.py)) -> get index coordinates using sitk library -> patches ### Meeting notes: - why does annotation has fewer records than total positive candidates: overlapping i.e. annotation nodule maybe have a large diameter and this may be more than 1 entry in the candidate. - 888 total ct scans were used from 1018 LIDC-IDRC dataset because they had to fulfil a certain condition eg. thickness > something. - Resample first, use writeImage to save the resampled images then convert to index coordinates using the sitk functions. - GPU only needed while training - Note: when we convert from world to index, x->z and z->x ## 20 March 2023 ### TO DO - [x] Kernel error? use python instead of jupyter: Kinda fixed for now by optimising the code - [x] Try Catch for patches out of the bound - [ ] If time permits, compare the candidate and annotations files - [x] when out of boundary, to check if candidate masses are important or not, just use the "class" feature. If the value is 1, it means that it is a nodule and it should be kept. - [x] For class = 0, check if it resides in the [LUNA16 segmented lung file](https://zenodo.org/record/3723295/files/seg-lungs-LUNA16.zip?download=1). If yes, keep them as well. - [x] First resample the segmented lung files - [x] check if the sizes of the resampled segmented lung = resampled original scans - [x] Add a padding for the nodules which are out of the bounds, save them separately for getting it checked - [ ] Try to build a model using Monai after. - [x] Also, as soon as we have access to DataLake, share samples with the supervisor. - **gaia data folder**: /data01/yash - **gaia code folder**: /nfs/homes/admyag@median.cad - Use SCP for upload and download. - [ ] Try to read the research paper: [Multi-Scale Gradual Integration CNN for False Positive Reduction in Pulmonary Nodule Detection](https://arxiv.org/pdf/1807.10581.pdf) - [ ] How LUNA images are created ### Notes: - In this meeting, I finally got a chance to play with the already segmented file from the LUNA16 dataset. - There are few things i observed: - The intensity is normalised from 0-5 (how? maybe for different section of lungs it is a different number) - Even though there were no gaps(I did not check for each image), Spacing was inconsistent hence it was important to resample. I chose 1mmx1mmx1mm as the new spacing so that it matches the resampled original scan so that it becomes easy to do operations on them. - After resampling, the sizes of the resampled segmented was consistent with the original resampled. - For the detid gn4VPJRA at point (136, 237, 281), we get different values in snapitk and slicer 3d. Maybe because of the directions are different. ## 23 March 2023 ### TO DO - [x] Change the patch size to 40,30and 20 because we only care for diameters of 40cm and less - [x] change the file structure to **seriesuid>det_id>patches** and store the patch in numpy zip for easy access - [ ] understand the coordinate systems - [x] [MONAI Coordinate systems](https://docs.monai.io/en/stable/transforms.html#orientationd) - [ ] [Nibabel Coordinate systems](https://nipy.org/nibabel/coordinate_systems.html) - [ ] [Slicer](https://www.slicer.org/wiki/Coordinate_systems) - [x] make your functions into a library so that we can just pip install and jupyter stays clean - [x] clean the functions so that they take path as parameter rather than a dataframe - [ ] add desciption for each function - [ ] get py files and upload to github - [x] **get some guidance** - [ ] keep lets say 10 files from each category for visualisation in a different folder ### Notes: #### Coordinate systems of snapitk and slicer 3d are different - AOPI and IAS - For now, I am doing clipping and scaling directly on the getPatches function as it can be rescaled back by online data augmentation using Monai. - For mapping the classes to the patch, I created a combined df with columns *detid, small, medium, large, class*. - I started to proceed with Monai for data augmentation. I can either deal with Numpy or Sitk images. I chose Numpy as - MONAI provides a set of pre-built transformations, such as RandRotate90, RandFlip, and RandZoom, that are based on NumPy arrays. These transformations are optimized to work with NumPy arrays and can be easily chained together using MONAI's Compose transformation. In addition, NumPy arrays can be easily saved and loaded, making it easy to save and reuse augmented data. - SimpleITK can also be used for data augmentation in MONAI by converting the image to a SimpleITK image and applying transformations using SimpleITK's built-in functions. However, this approach can be more complicated than using NumPy and may require additional steps to convert between SimpleITK and NumPy arrays. - However, we lose origin information when converting to numpy array and back to image, so we need to manually set the origin. Plus, sitk has it's own functions to augment data. - But then ultimately, we are training with tensors so it will lose that information anyways. And **how to store the augmented images? Locally or just use online augmentation?** - Are we still going to use the segmented Lungs later? ## 27 March 2023 ### TO DO - [x] make the library and push to heh, for the *resample* package, the setup.py already exists so directly link it to the requirements.txt file with some syntax - [x] Understand IAS and AOPI orientations. Monai always uses IAS, so try to keep informations like spacing, directions to compare. - [ ] Also, try to visualise the datalog - [ ] Always use MONAI Dictionary, it also offers lists. For debugging, use pycharm, it can show values stored in different envirnment variables as well. - [ ] Augmentation is going to be completely online so no need to save the data. Plus, we won't need origin either so do not worry about saving that as well. - [ ] For training, we are going to be using a simple 3d-CNN model like resnet. Luckily, tutor has already done the [work](http://heh.median.cad/vankhoa/3dDetection/src/branch/master/classif3d) and we can start continuing on that. - [ ] Build the complete workflow for just one patch size add more later. - [ ] No need to save information on dataset split as well. - [ ] Try to download the complete package by the tutor, in the Dataset->script add your own functions and make it a package by adding setup.py and own repo. - [x] **Presentation this week**: - [x] What am I currently doing. - [x] How have I organised the data. - [x] If my repo is done, I can share it as well. ### Notes - **Orientationd**: - a spatial transform in the MONAI (Medical Open Network for AI) library that allows for rotating a 2D or 3D image or volume along a specified axis or set of axes. It is a dictionary-based version of the Orientation transform that can operate on multiple data fields at once. - The transform assumes that the input data is in channel-first format, so it first transposes the data to channel-last format before applying the transform, and then transposes it back to channel-first format. **Channel-first format** is when you have ("no.of dim",x,y,z,..), eg in our case (3,x,y,z). - The transform should be used before any anisotropic spatial transforms, because these transforms may introduce rotation-dependent artifacts if the orientation of the input data is not normalized first. **Anisotropic spatial transforms** are spatial transformations that change the scale or aspect ratio of the spatial dimensions of an image or volume. In contrast, isotropic spatial transforms preserve the scale and aspect ratio of the spatial dimensions. - **Image orientation**: - **IAS** stands for "inferior-anterior-superior" and refers to the order in which the three axes of the image are arranged, with the z-axis pointing inferior (toward the feet), the y-axis pointing anterior (toward the front), and the x-axis pointing left-right. In the IAS convention, the origin of the image coordinate system is located in the upper left corner of the image. - **AOPI** (or RAS) stands for "right-anterior-superior" and refers to a different orientation convention, in which the x-axis points right, the y-axis points anterior, and the z-axis points superior (upward). In this convention, the origin of the image coordinate system is located in the lower left corner of the image. - **Monai uses the IAS** convention by default, and it's important to keep this in mind when comparing images with different orientations. The orientation information of an image includes not only the axis order (IAS or AOPI) but also the direction of the axes and the spacing between the voxels. - When comparing images with different orientations, we should pay attention to the orientation information, including the spacing and the direction of the axes, and make sure to convert the images to a common orientation before performing any analysis or processing. One way to do this is to use a transform that can adjust the orientation of the image, such as the **Orientationd** transform in MONAI. - To check the orientation of the image, we can use nibabel library to extract the orientation information from the affine matrix, and the spacing information from the image header. In our case, we got ```[[-1. 0. 0.][ 0. -1. 0.][ 0. 0. 1.]]```. This **orientation matrix** is responsible to store any transformation done to the original image. Here, it can be translated to flip of the x and y axes, and a 180 degree rotation around the z axis. This is the transformation that converts RAS coordinates to LPS coordinates. Therefore, the image is in RAS format. ## 31 March 2023 ### TO DO - [x] use a different resampling technique for the mask - [x] instead of saving it in the compose try to save it after the transform manually and play a bit more with MONAI - this worked - [ ] reorganise the folders better - [ ] put few samples on GAIA as well. ### Notes: - Use segmentation option in the description of slicer for segments on top of the volume for better visualisation - We will use GaussianLabel instead of bicubic as it might remove the mask entirely - Mask just used for understanding for now, since it is harder to get in real life situation. - "Forgot to ask": nib vs sitk, which one to use! - Note: After comparing the mask to the original image, I found that we were having an error in calculating index coordinates. We needed to flip the x coordinate. Afterwards, I extracted the patches again. - Note again: my patches were okay, the mask was flipped. After the new mask, I can confirm that the patches are perfect. ## 4 April 2023 ### TO DO - [x] Confirm if the mask, patch and full volume are consistent - [x] Do augmentation now for three patches, based on the diameter of the nodules: <5mm, 5-25mm, >25mm. ### Notes - Due to checking for diameters, I needed to work by combining candidates and annotations. I discovered that the only way to connect would be comparing with the coordinates. However, they are not the exact coordinates so a small threshold has to be added when comparing. - Here, we can clearly see why candidate file has more class=1 than total annotation files as they fall in the same ground truth region. ## 6 April 2023 ### TO DO - [x] make some example of augmentations: 2 patches for each size - [x] make split base on uid, 10 folds - [ ] train model with resnet50 or vgg, CE or focal loss - [ ] k-fold - [ ] online augmentation - [ ] initail performance analysis - [ ] using all the subsets - [ ] do validation on all candidates, build result table - [ ] create metrics AUC, FROC luna16 style ## 13 April 2023 ### TO DO - [ ] Organise the data into a python library ASAP, this will help orgaining thw workflow better - [ ] Update the model in every 2-5 epochs rather than every to speed up, play with batch_size, limit the number of input for training, dont run for all the batches, fix a number of iteration to speed up. But since we will be missing out on our precicous positive labels, try to experiment with the sampler. Try to run one epoch and see if it works. ### Meeting notes - dont focus on testing rn, create a library with the specified str, subsets are infact folds, json has paths, we can use mode function for eg: train, validation only, inference - validation loss is okay but auc will work better for selection of a best model for each fold. ## 18 April 2023 ### TO DO - [x] Try the [weightedrandomsampler](https://towardsdatascience.com/pytorch-basics-sampling-samplers-2a0f29f0bf2a) as our sampler. - [ ] Install office for Ubuntu for easier openning of csv files using excel - [x] Instead of dividing folds to a list use the whole dataset instead. - [ ] validate in every 10 epochs - [ ] use the codes provided on teams for more info. - [x] test if pausing and resuming train works. - [ ] if __name__ == "__main__": when running code inside - [ ] try for 1 fold first, calculate auc, recall, f1 - [ ] Dataloader patch augmentation works ### TO DO - [ ] add softmax layer - [x] better folder orgainisatin - based on folds (each fold will have their own csv files) - [x] use tensorboard to monitor - [x] validation dataset must not include augmentations - [ ] yaml stored for each train - [ ] GOAL: improve recall! ## 9 May 2023 ### TO DO - [x] loguru - [ ] print with logging - [ ] investigate what is wrong - [ ] decrease validation - [ ] implement all the metrics - [ ] do not automatically use the last checkpoint - use arguments ## 22 May 2023 - [ ] hard negative mining to be done later - [ ] multiscale patches - read the top perfomer paper to understand how they use patches of different sizes in order to train the model. - [ ] run the evaluation script and get the results - [ ] mentor shared a folder with his outputs, maybe try to understand them ## 15 June 2023 - [x] rescale using monai convolution - [x] decrease the number of features - [x] use densenet121 then - [ ] use BCEloss by torch - [x] optimise lr and optimizer - [x] add a [learning rate scheduler](https://machinelearningmastery.com/using-learning-rate-schedule-in-pytorch-training/) - full vol:/mnt/datalake/DS-lake/LCS_MVP0/Task116_MVP0_pop1/raw_splitted/imagesTs - mask: /mnt/datalake/DS-lake/LCS_MVP0/Task116_MVP0_pop1/raw_splitted/labelsTs