CS4240 Deep Learning

# Segmentation of individual mouse embryonic cell nuclei using CNNs *Written by: Rael Huizenga 4866886 Martijn Lieftinck 4862856 Aman Singhvi 4772059 Timo Warmenhoven 4958012* Convolutional Neural Networks (CNNs) have shown great promise in bioimage analysis. These algorithms utilize deep learning techniques to automatically identify and process regions of interest within complex biological images. One application where CNNs have proved to be very precise is nuclear segmentation in time-series 3D microscopic images. Over a 4-week period, our group has worked towards reproducing the paper _'3D convolutional neural networks-based segmentation to acquire quantitative criteria of the nucleus during mouse embryogenesis'_. This paper was published in 2020 by Yuta Tokuoka, Takahiro G. Yamada et al. It proposes a new and supposedly superior way of segmenting nuclei in time-series 3D microscopic images. This project involved understanding the proposed network (QCANet) and reproducing it from scratch, based on the description in the paper. But let's take a step back. Because what is a CNN and how does it work? A CNN, or Convolutional Neural Network, is a type of deep learning model designed specifically for analyzing visual data such as images. It consists of different types of layers, including convolutional layers, pooling layers, and fully connected layers. Convolutional layers try to extract features from images. It uses these learned features from images to complete tasks on new data. This makes CNNs very strong in recognizing patterns in images. A feature that can be very useful in our task. Figure 2 in the paper shows the inner workings of QCANet. When QCANet is trained, it does the following: It takes a 3D microscopic image of a mouse embryo. This image is then fed to two different CNNs that perform different tasks in the segmentation process. The first is the Nuclear Segmentation Network (NSN), which focuses on segmenting the embryonic cell nuclei from the image background. This network employs convolutional layers to extract features relevant to distinguishing between the foreground (cell nuclei) and the background. The second network is the Nuclear Detection Network (NDN) which aims to identify individual cells in the image. This network utilizes convolutional layers to identify and mark the center of each cell present in the image. In post-processing, the knowledge of the individual cell locations from the NDN and the image segmentation from the background in the NSN is combines. This gives an output image where all the cells are segmented and individually labeled (as depicted in Figure 2, 'Output Image'). <center> <img src="https://hackmd.io/_uploads/BkqKZIKkC.png" alt="drawing" width="500"/> </center> The goal of the project was to recreate Figure 3 from the paper which, is displayed below. In this image there are 6 columns. The first column shows the raw data, these are the 3D images used as training and test set. In the second column the ground truth is displayed which is used to train the model and was generated manually. The third column shows the results of a 3D U-net model after being trained on the same dataset and the fourth column shows the results of a 3D Mask R-CNN after being trained on the same model. The fifth column shows the results of QCANet without nuclear detection and finally column six shows the results of QCANet as a whole. Using CNN's for bioimage segmentation tasks is clearly not new. The paper claims that QCANet has superior performance in this segmentation task when compared with its precursors. <center> <img src="https://hackmd.io/_uploads/Bkuh7LtJA.png" alt="drawing" width="550"/> </center> To reproduce this paper we first applied pre-processing to the raw image files. After this, we fed the images in to an NSN model and an NDN model. The NSN model was trained using the ground truths displayed in the image above. A simple dice loss function was used to measure the accuracy after each model pass. The NDN was trained with ground truths that contained only the centres of the cells. As you can imagine, these centres are like needles in a haystack. An ADAM loss function was used to allow for a varying stepsize. Lastly, the outputs were post-processed to produce our own versions of columns 5 and 6 in figure 3. Unfortunately, the NDN didn't yield any useful results due to the training time being too long for the scope of our project. The rest of this blog post has been split up in to each of the individual steps with comments on decisions we made along the way. To jump to a specific section, there are quick links below: [toc] ## Pre-Processing We started with four sets of images that needed to be pre-processed. These were the training inputs, the test inputs, the training labels for NDN, NSN and QCANet and the labels for testing. The code to pre-process the data consisted of the following steps: 1. Unpack the files using the Python Image Library into a 51x114x112 sized array. 2. Interpolate the image using the Pytorch interpolate function in trilinear mode 3. Pad the image using the Pytorch 3D reflection pad function to be 128x128x128 4. Normalize the values in the image to be between 0 and 1 5. Apply data augmentation by flipping all input images in the _x_ direction, _y_ direction and _xy_ direction. 6. The final result was an array in the format [_batches, channels, x, y, z_] for the training images and [_batches, x, y, z_] for the labels. The difference being one additional unsqueeze statement for the training images. In the paper it was stated that the _z_-axis was interpolated 2.175 times using bicubic interpolation. However, the decision to use trilinear interpolation was due to the Pytorch interpolation function only having a bicubic setting for 2D images. Furthermore, the size of the output was simply set to 112x114x112 pixels. It was also stated in the paper that: _"The patch size of QCANet was 128 voxels, so the size of the mirror-padded region was 64 voxels."_ This made little sense to us, because a padded region of 64 voxels would not fully cover any of the sides of the 3d image which contained 112x114 or 112x112 voxels. Furthermore, the connection between the patch size of 128 voxels and the mirror padded region was also unclear as we expected the mirror padded region to depend on the dimensions of the 3d image as a whole. We decided to pad the images so that the output would be 128x128x128. This was our interpretation of what the paper was trying to say and resulted in a reasonable padding size. ## Model The QCANet model consists of two separately trained CNNs: NSN and NDN. These networks are trained for different tasks, and each network has a distict architecture, reflecting the complexity of the given task. During post-processing, the output of both networks is combined into a single image. In the supplement of the paper, tables are provided detailing the parameters for each layer of both networks. These tables were used to reproduce the models in Python. Let's dive a bit deeper into the architecture of both networks. ### NSN The Nuclear Segmentation Network (NSN) aims to distinguish cell nuclei from their background in a given input image. The NSN consists of 17 distinct network layers of different types. In the figure below, a complete desciption of the network architecture is depicted. After the final layer, the output goes through a [Softmax](https://en.wikipedia.org/wiki/Softmax_function) function to squeeze the values between 0 and 1. <center> <img src="https://hackmd.io/_uploads/Hy6U3vMeA.png" alt="layers" width="360"/> </center> <center> The model architecture of NSN. CBR: Convolution + Batch Normalization + ReLU, k: kernel size, s: stride, p: pad size, f: filter. </center> #### 3D CBR The most common layer type consists of a combination of the following three operations: 1. 3D Convolution 2. Batch normalization 3. ReLu activation These three operations are performed sequentially and together they are abbreviated as _3D CBR_. The CBR layers are defined by their kernel size, stride, padding and number of filters. Batch normalization is used to centre the output of the convolution around the origin with unit variance. Batch normalization results in a smoother feature landscape which in turn facilitates faster learning by preventing ocillations during optimizing. #### 3D Max Pooling Besides convolutional layers, the NSN also contains pooling layers, as is common in CNNs. In a sense, the pooling layers are used to "zoom out" outside the scope of the convolutional kernel. With the given pooling kernel size _k_ = 2, in combination with stride _s_ = 2, the _3D Max Pooling_ layers summarize a region of four pixels from the previous convolution output into one pixel (having the maximum pixel value out of the four original pixels). This one pixel then represents to what extent a certain feature was present in the region of those four pixels. #### 3D Deconvolution The third type of network layer is the _3D Deconvolution_ layer. Since each pooling layer reduces the size of the image, we need a way to upscale the image in order to obtain the same dimensions as our input image. This upscaling is done through deconvolution layers. For each pooling layer, we use one deconvolution layer to restore the change in dimensions. An important note is that the term deconvolution is ambiguous. Outside of deep learning, deconvolution is defined as the inverse operation of convolution. This means than performing deconvolution on the output of a convolution results in the original input. Here, however, deconvolution is used as a synonym fot transposed convolution. We do not intend to reverse the effects of previous convolutions. The goal of these layers is to restore the original input dimensions. The deconvolution works by padding each indivdual input pixel before applying regular convolution. For the NSN, with _k_ = 2, _s_= 2, this means that in each dimension, the output contains two pixels for every pixel in the input. ![deconvolution-ezgif.com-cut](https://hackmd.io/_uploads/S1mrySBg0.gif) <center>Schematic of the process of transposed convolution, a.k.a. deconvolution, with kernel size = 3,   stride = 2, padding = 1. [source](https://towardsdatascience.com/what-is-transposed-convolutional-layer-40e5e6e31c11#) </center> #### Concatenation Lastly, NSN contains two _Concatenation_ layers. In these layers, channels from earlier layers are concatenated to the list of output channels from the previous layer. Although the choice for which layers to concatenate might seem a little arbitrary, it's important to note that concatenation is only possible when the channels of both of these layers have equal dimensions. Therefore, the options for which layer to concatenate to which is limited. The concatenation operation functions as a skip connection. Skip connections allow certain layers to directly communicate with other layers that lie deeper in the network. Possible benefits of skip connections are improved accuracy and generalization, and can help in learning more complex patterns. Skip connections are also a way to combat the vanishing gradient problem, by providing a shortcut through the network for the gradient. ### NDN The main goal of the Nuclear Detection Network (NDN) is to separate individual nuclei in the input image by finding their center points. This a considerately more compex task compared to the NSN. This is likely the motivation to use an extended network architecure, as depicted below:  <center> <img src="https://hackmd.io/_uploads/rJk53wGlR.png" alt="layers" width="360"/> </center> <center> The model architecture of NDN. CBR: Convolution + Batch Normalization + ReLU, k: kernel size, s: stride. p: pad size f: filter. </center> As you can see, NDN has a very similar structure as NSN. Again the figure omits the final [Softmax](https://en.wikipedia.org/wiki/Softmax_function) layer that squeezes the output between 0 and 1. Like the NSN, the NDN contains the same four types of layers: - [3D CBR](####3D-CBR) - [3D Max Pooling](####3D-Max-Pooling) - [3D Deconvolution](####3D-Deconvolution) - [Concatenation](####Concatenation) However, despite the similar structure, the NDN contains almost twice as many layers as the NSN. We observe the same pattern of two convolutional layers, followed by a pooling layer in the first half of the network. In the second half the pooling is replaced with deconvolution followed by concatenation. Aside from the number of layers, antoher difference with NSN is the convolutional kernel size. For the NDN, this kernel size is equal to 5, as opposed to the kernel size of 3 for the NSN. One possible reason for this choice is that the larger kernel will converge faster towards the center of the nucleus, because it will detect where the edges are sooner compared to a smaller kernel. As a final remark regarding the model architecture and parameters, we want to briefly discuss the last layer of the NSN and NDN. Both networks take one input channel, and output two channels. This is an interesting choice by the authors as both networks are trained on binary labels, meaning that a single output channel would suffice in predicting whether a pixel belongs to class 1 (nucleus/nuclear centroid) or 0 (background/non-centroid, for the NSN and NDN respectively). Moreover, the labels are also provided as a single channel, resulting in having to throw away one output channel when calculating the loss. Unfortunately, we have not managed to figure out the relevance of the second output channel.  ## Training Having implemented both the NSN and NDN, we could start training the networks on the [pre-processed](##Pre-Processing) training data. To train the model, we used a "traditional" training loop, executing the following steps: ```Python for n epochs: for m batches: for each sample in batch: 1. perform a forward pass through the network 2. calculate the loss 3. update the network weights using the optimizer ``` The authors trained their model (both NSN and NDN) for 150 epochs. We however, restricted by limited resources, managed to train the NSN for 50 and NDN for 20 epochs. Batch size is not mentioned in the paper, so we picked a number that made sense considering the available training data. The training data consisted of timeseries of 3D images. Each timeseries consisted of 11 time stemps. Since it required to much memory to train our model on a full timeseries at once, we used a batch size of one time stemp (so one 3D image). Training was done on a GPU through the use of [Kaggle](https://www.kaggle.com/). In the end, training NSN took 9 hours and training NDN took 11 hours. ### Loss function An important aspect of training is the choice of loss function. The performance of a loss function can be very task-specific. Therefore, one must carefully choose the best suited loss function based on the task the model is trained for. In case of nuclear instance segmentation, [Dice Loss](https://arxiv.org/pdf/1707.03237v3.pdf) is an adequate choice for a loss function, as it is known to suppress the influence of label imbalance. Label imbalance can be an issue for the task at hand, since it is expected that for most slices of the 3D images, only a small fraction of pixels will show a nucleus. <center> <img src="https://hackmd.io/_uploads/rksKNdBgA.png" width="360"/> </center> <center> Generalized Dice Loss. wl: weight of class l, n: number of pixels, rln: reference, or label of pixel n,  pln: predicted label for pixel n. [source](https://arxiv.org/pdf/1707.03237v3.pdf) </center> The Generalized Dice Loss is dependent on both the intersection (product) and the union (sum) between the pixel labels and the model output. As can be seen in the equation above, the loss is minimized when we maximize the intersection, and minimize the union. The effect of maximizing the Intersection over Union (IoU) can best be showed visually, as is done below: <center> <img src=https://hackmd.io/_uploads/HJemhdHxR.png width="420"/> </center> <center> Intersection over union (IoU) visualization [source](https://idiotdeveloper.com/what-is-intersection-over-union-iou/) </center> The figure above clearly shows how the Dice Loss becomes smaller when output and labels start to match better and better. During training, we used equal weights for the two classes. Both NSN and NDN use Dice Loss as their objective function. We created our own implementation of Dice Loss as a subclass of the `nn.Module` class. ### Optimizer As for the loss function, the chosen optimizer for updating weights during training can have a great influence on model performance. Different inputs and loss functions can have very different local surfaces for the gradient to move across. It is therefore important to choose an optimizer that can navigate the loss towards a minimum as efficiently as possible. Due to the limited time frame of the project, we were not able to perform our own optimizer validation tests. However, as stated in the paper, the authors did and found that NSN and NDN performed best with different optimizers. This proves as a good example that, despite the fact that both networks use the same loss function, the difference in labels between the two results in a change in what is considered to be the most efficient optimizer. For NSN, [Stochastic Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (SGD) gave the best results, while for NDN the [Adam optimizer](https://arxiv.org/pdf/1412.6980.pdf) outperformed the others. We adopted these findings in our reproduction and trained NSN with SGD and NDN with Adam, both implemented using the `torch.optim` package. ## Post-Processing With NSN and NDN fully trained on the available training data, we're not exactly there yet. Assuming training went well, we can now create an output similar to the training labels, given an input image. The goal however, is to sucessfully perform instance segmentation on the given input. In order to achieve this instance segmentation, we need to combine the outputs of NSN and NDN and perform a [watershed](https://en.wikipedia.org/wiki/Watershed_(image_processing)) transformation. What watershed does, is take the output of NDN to use as markers, and the output of NSN as boundaries. Each marker is than labeled as an individual nucleus. Next, the volume of the nuclei grows simultaneously for all nuclei, in all directions, until the volume collides with either a boundary (defined by NSN) or another nucleus. This process continues untill the enitre foreground is filled. We are then left with a segmentation of individual nuclei corresponding to the original input image. The watershed transformation was inplemented in Python using the `skimage.segmentation` package. ## Results Our reproduction of the paper unfortunately did not give the exact same results as the paper. This was mainly due to the more complex NDN network which we didn't have sufficient time/computing power to train. The NSN network results did look more promising. ### NSN The NSN did show promising results. Training the 50 epochs gave us a loss of almost 0. We were not able to combine the NDN and NSN, thus we plotted the results of only the NSN with the watershed. This was also done in the paper. The model is still able to segment the embryo cells, but the results are worse than the fully functional model. The visual results we plotted are shown below. In the test set from the dataset, only the ground truth of the full QCANet (i.e. NSN + NDN) was given. This means that we could not compare our result from the NSN with a ground truth NSN. The result shows that the NSN works reasonably well when the cells are separated. But at later times, when the cells are closer together, the NSN alone cannot distinguish between the different cells. ### NDN The NDN in the paper was trained on 150 epochs on larger batch sizes. With the training of our NDN model we immediately saw that this was not viable. We thus shrinked the batch size, and trained the model on only 20 epochs. These 20 epochs unfortunately were not enough to produce any worthwhile results. The model had a loss of 1 and did not find the centres of any embryos. Again, as mentionned previously finding 1-10 points on an image of 128x128x128 pixels is a difficult task to train a network for. The results from the NDN, were unfortunately unusable with the current training. #### Our results: <center> <img src=https://hackmd.io/_uploads/rk7N3iKeA.png width="420"/> </center> <center> NSN results of first embryo from test set at t=6 </center> <center> <img src=https://hackmd.io/_uploads/H1AcJnKg0.png width="420"/> </center> <center> NSN results of first embryo from test set at t=11 </center> #### Ground truth: <center> <img src=https://hackmd.io/_uploads/H1DInsFgA.png width="420"/> </center> <center> Ground truth of first embryo from test set at t=6 </center> <center> <img src=https://hackmd.io/_uploads/BJL6JnKgA.png width="420"/> </center> <center> NSN results of first embryo from test set at t=11 </center> ## Discussion To conclude, we made a number of discoveries while reproducing this paper. Implementing the different functions of the network as a whole turned out not to be the greatest challenge. The paper and the supplement provided sufficient information to do so, with a few independent decisions from our side being made along the way. The first surprise for us was that the ground truth labels also needed to be pre-processed. We discovered this after running into an error that our training images and label images were not of the same size. This felt counterintuitive as interpolating and padding our ground truth felt unintuitive as we were editting the _ground truth_. However, comparing images of different sizes also doesn't make sense so we did end up padding the labels. Getting Kaggle to work required much more time and effort than we initially planned for. After running some initial mini training loops locally on the CPU, it was clear that the final model would have to be trained using a GPU. The first challenge in Kaggle was accessing the GPU. The solution was verifying your account via two factor authentication to recieve GPU computing time. Once Kaggle was working, we still had a shortage of computing power. During network training the memory budget would be exceeded leading to a RuntimeError. To fix this, the pre-processing of the data was adjusted. Initially, batches of 11 images were fed in to the model per training loop, leading to the memory running out. The batch size was adjusted to be just 1 image, significantly reducing the memory required to run the model. The final issue we had with Kaggle was that it would automatically stop running after 40 minutes displaying a prompt with, "Are You Still There? You've not edited your work in 40 minutes do you want to cancel or continue editing?". Besides the irony of a platform designed to train long complex networks prompting this during our training, we found that we could use the option "run and save" to prevent this from happening. The last surprise came from an assumption we made early on in our project only to realize we were mistaken later on. Initially we thought that the ground truth labels would be used for NDN training with every label identifying one specific cell. This would mean that the NDN labels would be 1 for the first cell in the image, 2 for the second cell, 3 for the third cell and so on. This was not the case we realized later in the project. It was simply the task of the watershed to use different colours for the different cells. The labels for the NDN were simply 0 for the background and 1 for the cell centre.  ## Contributions ### Rael Code: - NSN implementation - Train & test loop implementation - Watershed implementation - Plot the results Kaggle: - Run NSN ### Martijn Code: - NDN implementation - Dice Loss implementation Report: - Model section - Training section - Post-Processing section ### Aman Code: - Preprocessing of images Report: - Preprocessing section - Discussion section ### Timo Code: - Kaggle model - Kaggle notebooks Report: - Introduction - Results