Reproduction efforts of an Automated Pavement Crack Segmentation algorithm that uses a U-Net-Based CNN

# Reproduction efforts of an Automated Pavement Crack Segmentation algorithm that uses a U-Net-Based CNN #### By Sjoerd van den Bos (4239261), Reinier Koops (4704312) and Karel Scheepstra (4593162) ## Introduction In June of 2020 Stephen Lau, Edwin Chong, Xu Yang and Xin Wang published a paper that describes the characteristics of a U-Net-based convolutional neural network that can perform segmentation tasks on pavement crack images and evaluates its performance on the CFD [2] and Crack500 [3] datasets while also comparing its performance to network architectures of other purblications [1]. In this blog we will guide the reader through our efforts of (from scratch) replicating the U-Net-based CNN that was designed and evaluated by Lau et al. and we will therefore refer to the original paper of Lau et al. [1] simply as 'the paper' for the remainder of this blog. At the time of publishing the paper, the U-Net-based CNN architecture achieved the best segmentation performance on the respective datasets to the best knowledge of the authors. Computer vision is increasingly being used for performing segmentation tasks to identify cracks in images. Applications range from cracks in pavement (in this paper) or structural concrete infrastructure [4] to crack detection in nuclear power plant components [5]. In this blog we will first visualize the data that was used and explain how we implemented the augmentations that were described in the paper. Second, we will show how we replicated the U-net based network structure and how we implemented it's additional features, which are: * a pretrained ResNet-34 encoder * spatial and channel squeeze and excitation modules * the dice loss function * AdamW optimizer * the cyclical learning rate schedule * two stage training * training on progressively increasing image sizes * the evaluation convention that was used We will present the results of training our replicated algorithm on the datasets and lastly, we will discuss the results and elaborate on hyperparameters that were not described in the paper. The code used by the writers of the paper was not available and therefore this reproduction is actually a replication from scratch. The paper states that PyTorch is used as the deep learning framework, and additionally the *fastai* library. We will use only the PyTorch library in an effort to replicate the code. ## Data Both of the datasets (CFD & Crack500) consist of images of cracks with a corresponding ground truth image that subjectively segments the crack by assigning pixel values of 1 while all other pixel values are 0 (one-hot encoded). Example pairs of both datasets are shown in the pictures below. Clearly, labelling of the ground truth images is subjective with respect to the expert, and additionally, the images show a transition region around a crack while the ground truth images do not account for this transition due to only one-hot encoding the pixels. ![](https://i.imgur.com/oRMQs2S.jpg =180x)![](https://i.imgur.com/2aU3qXj.png =180x)![](https://i.imgur.com/WNrwGLi.jpg =180x)![](https://i.imgur.com/T1ajKSP.png =180x) *CFD example images + ground truths* ![](https://i.imgur.com/b0FhxcV.jpg =180x)![](https://i.imgur.com/DFSGwks.png =180x)![](https://i.imgur.com/P8ZPZNl.jpg =180x)![](https://i.imgur.com/y8I6xQA.png =180x) *Crack500 example image + ground truth* ### Augmentation It is evident from the images that the orientation of a crack is random and that lighting conditions differ for each of the images. To make the algorithm more robust with respect to the previously mentioned attributes, the following transformations are applied in each training iteration: * random resizing and cropping to form square images of size 128x128, 256x256 or 320x320 pixels * random rotations between 0 to 360 degrees, where the pixels are first reflected over the image borders to avoid empty triangles in the rotated image corners. * Random horizontal and vertical flips * Random changes in lighting, this applies only to the training set since the ground truths are one-hot encoded. Since training is done on square images that progressively increase in size, all images are cropped randomly to form square images and the variable``` img_factor```is introduced, which specifies the rezising factor with respect to the original image to be able to create the three mentioned sizes. We will elaborate on this implementation when the training loop is explained. To resize an image, a certain interpolation mode is usually applied. The paper doesn't mention which interpolation mode is applied. In the ```torchvision.transforms``` library, only the 'bilinear' and 'nearest' interpolation modes are available for tensors [6]. Since the images do contain transition regions but the ground truth labels are one-hot encoded, we chose to apply the 'bilinear' interpolation mode to the images and the 'nearest' interpolation mode to the ground truths, in order to retain the continuance of transition between pixels. The mismatch in presence and absence of transition regions between the images and labels will be featured again when the evaluation scheme is discussed. All transforms are implemented using the following lines of code: ```python from torchvision import transforms def get_transforms(img_factor): resize_image = tuple(int(img_factor * x) for x in (320, 480)) crop_image = resize_image[0] shared_transforms = [ transforms.RandomCrop(crop_image), transforms.Pad(200, padding_mode='reflect'), transforms.RandomRotation((0,360)), transforms.CenterCrop(crop_image), transforms.RandomHorizontalFlip(), transforms.RandomVerticalFlip(), ] # Bilinear interpolation since value can be [-255, +255] tf_compos = transforms.Compose([ transforms.Resize(resize_image, interpolation=InterpolationMode.BILINEAR), *shared_transforms, transforms.ColorJitter(brightness=0.05, contrast=0.05), transforms.ToTensor() ]) # NN interpolation since value can only be [0 or 1], tf_compos_gt = transforms.Compose([ transforms.Resize(resize_image, interpolation=InterpolationMode.NEAREST), *shared_transforms, transforms.ToTensor() ]) return tf_compos, tf_compos_gt ```  ## Method The paper describes a U-Net-based neural network architecture [7] that uses a ResNet-34 encoder [8], which was pretrained on ImageNet, with its last two layers removed. Furthermore, the layers are divided into three layer-groups which will be used to be able to set a certain learning rate per layer-group and will be discussed further when the optimizer is introduced. As highlighted in the architecture overview picture, the first layer-group comprises all layers from the input to the 128-channel residual block. The second layer-group includes all layers from the 256-channel residual block to the rest of the ResNet34-encoder. The final layer-group is formed by the entire decoder. The network is trained on a random split of 70:48, train and test images for the CFD dataset, and a random split of 2021:1347, train and test images for the Crack500 dataset. ### Architecture ![](https://i.imgur.com/YDupw1l.png =540x) *Proposed network architecture from Lau et al. [1] with the layer-groups indicated* The picture above shows the proposed network architecture from the Paper with the layer-groups indicated. The architecture features the ResNet-34 architecture that is shown in the picture below. The ResNet-34 architecture mainly consists of residual blocks, for which the code implementation is also shown below. The parameters of all blocks but the pretrained ResNet-34 block, are initialized with the method of He et al. [9].   ![](https://i.imgur.com/EMUjGHy.png =110x) ![](https://i.imgur.com/KGNskJW.png =420x) *ResNet-34 architecture [8] and the construction of one residual block* Another feature included in the proposed architecture are the spatial and channel squeeze and excitation modules (SCSE) as proposed by Roy et al. [10], between the batch normalization layer and the transpose convolution layer. The SCSE module is a combination of a spatial squeeze and excitation module (sSE) and a channel squeeze and excitation module (cSE), the latter two modules are defined individually and then combined into the SCSE, the implemented code is shown below. In short, the SCSE module concurrently recalibrates the input spatially and channel-wise, for further explanaition we refer to the paper of Roy et al. [10]. Usage of the SCSE modules was shown to be benificial in an ablation study of the paper.  ```python import torch.nn as nn class SSEBlock(nn.Module): def __init__(self, in_channels): super(SSEBlock, self).__init__() # Output channel = 1, 1x1 convolution self.conv = nn.Conv2d(in_channels, 1, 1) self.sigmoid = nn.Sigmoid() # Do He. K. et al. init for Upsampling decoder nn.init.kaiming_uniform_(self.conv.weight, mode='fan_in', nonlinearity='relu') def forward(self, x): batch_size, num_channels, H, W = x.size() y = self.conv(x) y = self.sigmoid(y) return x * y.view(batch_size, 1, H, W) class CSEBlock(nn.Module): def __init__(self, in_channels, reduction=2): super(CSEBlock, self).__init__() # Global pooling == AdaptiveAvgPool2d self.avg_pool = nn.AdaptiveAvgPool2d(1) self.fc1 = nn.Linear(in_channels, in_channels // reduction) self.relu = nn.ReLU(inplace=True) self.fc2 = nn.Linear(in_channels // reduction, in_channels) self.sigmoid = nn.Sigmoid() # Do He. K. et al. init for Upsampling decoder nn.init.kaiming_uniform_(self.fc1.weight, mode='fan_in', nonlinearity='relu') nn.init.kaiming_uniform_(self.fc2.weight, mode='fan_in', nonlinearity='relu') def forward(self, x): batch_size, num_channels, _, _ = x.size() avg_pool_x = self.avg_pool(x).view(batch_size, num_channels) y = self.fc1(avg_pool_x) y = self.relu(y) y = self.fc2(y) y = self.sigmoid(y) return x * y.view(batch_size, num_channels, 1, 1) class SCSEBlock(nn.Module): def __init__(self, in_channels, reduction=2): super(SCSEBlock, self).__init__() self.CSE = CSEBlock(in_channels, reduction) self.SSE = SSEBlock(in_channels) def forward(self, x): return self.CSE(x) + self.SSE(x) ``` For more details on the implementation of the network architecture and its features we would like to refer to our complete code on our github repository: https://github.com/sjoerdvandenbos/deeplearning_reproducibility ### Loss function To compute the loss value, with respect to which the network can optimize its parameters, the paper uses a dice coefficient loss function as introduced in [11]. This function outputs a value between 0-1 and is equivalent to the F1 score. This function is used to accommodate for the relatively large class imbalance, which refers to the ratio of crack pixels:non-crack pixels, as the non-crack pixels far outnumber the crack pixels. The final layer of the network is a sigmoid layer and therefore the output of the network represents the probability that a pixel is a crack pixel. The loss function is given as follows: ![](https://i.imgur.com/kaAO20j.png =200x) In which ![](https://i.imgur.com/WJptdZ9.png =11x) represents the prediction and ![](https://i.imgur.com/5jVJPbP.png =11x) represents the ground truth. This loss function is implemented using the following code, a small term epsilon was added to avoid division by zero: ```python import torch def batch_dice_loss(true_val, pred_val, epsilon=1e-8): # Flattened from [N, 1, H, W] to [N, H*W] true_val = true_val.flatten(start_dim=1) pred_val = pred_val.flatten(start_dim=1) numerator = 2. * (pred_val * true_val).sum(dim=1) denominator = (pred_val).sum(dim=1) + (true_val).sum(dim=1) return torch.mean(1 - ((numerator + epsilon) / (denominator + epsilon))) ``` ### Optimizer The optimizer updates the parameters of the network in order to achieve a lower overall loss. The paper uses the AdamW [12] optimizer, which is included in the torch.optim library [13] and takes the following hyperparameters: $\alpha,\beta, \epsilon, \lambda$, and lr which is the learning rate. All hyperparameters are set in the paper to be equal to the default values, except for the learning rate. The learning rate is altered in each training iteration as a result of two features of network: 1) cyclical learning and 2) two stage training. Additionally, the learning rates of layer-groups 1 and 2 are a fraction of the learning rate of layer-group 3 which uses the actual learning rate provided by the learning rate scheduler. The implementation of different learning rates across layer-groups will be shown together with the implementation of two stage training. #### Cyclical learning rate The paper applies a cyclical learning rate schedule simular to the one proposed by [14] and [15]. The schedule starts with a minimal learning rate that is equal to 5% of the maximum learning rate to be achieved in a later stage. After each training iteration, the learning rate is linearly increased such that it equals the maximum learning rate after 40% of the total number of training iterations have been completed. In the following training iterations, the learning rate is linearly decreased until close to zero for the final training iteration. Since the maximum learning rate wasn't provided by the paper, a value of 0.005 was chosen, based on short hyperparameter tuning on the CFD dataset. The learning rate scheduler is implemented with the following code: ```python scheduler = torch.optim.lr_scheduler.OneCycleLR( optimizer, max_lr=0.005, epochs=EPOCHS, steps_per_epoch=len(train_loader), anneal_strategy='linear', pct_start=0.4, div_factor=20, final_div_factor=20000, three_phase=True ) ``` #### Two stage training In general layer-group 1,2 and 3 use the following ratio compared to the learning rate that is provided by the cyclical learning rate scheduler, $1/9:1/3:1$ respectively. However, training is additionally done in two stages: in the first stage comprises of 15 epoch, the parameters of layer-group 1 are frozen (=learning rate set to 0) while the other layer-groups update using the learning rate that is provided. In the second stage, that makes up the next 30 epoch, all layer-groups use their respecitve ratio of the learning rate provided by the cyclical learning rate scheduler. The altering learning rates for the layer-groups and over the iterations is implemented as follows: ```python def layer_split(network): layer_1 = [] layer_2 = [] layer_3 = [] layer_1_to_include = ["blueBlock","resBlock1","resBlock2"] layer_2_to_include = ["resBlock3","resBlock4"] for name, param in network.named_parameters(): if any(substring in name for substring in layer_1_to_include): layer_1.append(param) elif any(substring in name for substring in layer_2_to_include): layer_2.append(param) else: layer_3.append(param) return layer_1, layer_2, layer_3 layer_1, layer_2, layer_3 = layer_split(network) optimizer = torch.optim.AdamW([ {'params': layer_1, 'name': 'layer_1'}, {'params': layer_2, 'name': 'layer_2'}, {'params': layer_3, 'name': 'layer_3'}], betas=(0.9, 0.999), weight_decay = 0.01) EPOCHS = 45 epochs_1 = ( EPOCHS // 3 ) epochs_2 = ( EPOCHS // 3 ) * 2 for epoch in tqdm(range(EPOCHS)): optimizer.param_groups[0]["lr"] = 0 if epoch < epochs_1 else optimizer.param_groups[2]["lr"] / 9 optimizer.param_groups[1]["lr"] = optimizer.param_groups[2]["lr"] / 3 ``` ### Training Combining the two stages, training is done for 45 epoch on the training sets after performing the described random splits. The paper states that progressively increasing image sizes of 128x128, 256x256, and 320x320 are used during training. The paper doesn't mention the amount of epochs that is used per image size, therefore we assume that they are equally divided over the total amount of epoch (=15 epoch per size). We change the image size during training by reloading the train_loader in the training loop. When reloading, we specify the desired```img_factor```(this variable was introduced in the part on data augmentation). The following code shows this implementation: ```python def get_data_loaders(split_seed, bs_train, bs_test, img_factor=1, dataset="CFD"): current_path = os.path.abspath(os.getcwd()) image = "/CFD/cfd_image/" gt = "/CFD/seg_gt/" ratio = [70, 48] if dataset == "CRACK500": image = "/CRACK500/crack_image/" gt = "/CRACK500/seg_gt/" ratio = [2021, 1347] tf_compos, tf_compos_gt = get_transforms(img_factor) if dataset == "CFD": dataset = CFD(current_path + image, tf_compos, current_path + gt, tf_compos_gt) else: dataset = C5D(current_path + image, tf_compos, current_path + gt, tf_compos_gt) # Manual seed added such that same split is kept, # even though a new split is made with different sizes train_data, test_data = random_split(dataset, ratio, generator=torch.Generator().manual_seed(split_seed)) train_loader = DataLoader(train_data, batch_size=bs_train) test_loader = DataLoader(test_data, batch_size=bs_test) return dataset, train_loader, test_loader #initial train loader with img_factor of 0.4 -> 128x128 dataset, train_loader, test_loader = get_data_loaders(split_seed, bs_train, bs_test, 0.4, datasetname) for epoch in tqdm(range(EPOCHS)): if (epoch == epochs_1): # Get dataloaders (256x256) dataset, train_loader, test_loader = get_data_loaders(split_seed, bs_train, bs_test, 0.8, datasetname) if (epoch == epochs_2): # Get dataloaders (320x320) dataset, train_loader, test_loader = get_data_loaders(split_seed, bs_train, bs_test, 1, datasetname) ``` ## Results After training for about 3 minutes on the CFD dataset and for about 1 hour for the Crack500 dataset, we obtain the following prediction results: ![](https://i.imgur.com/bbvknto.png) *Example 1: Crack segmentation on the CFD dataset: original image, ground truth, and our prediction* ![](https://i.imgur.com/w8b58lB.png) *Example 2: Crack segmentation on the CFD dataset: original image, ground truth, and our prediction* ![](https://i.imgur.com/dOmf3so.png) *Example 3: Crack segmentation on the CFD dataset: original image, ground truth, and our prediction* ![](https://i.imgur.com/YYcVUVu.png) *Example 4: Crack segmentation on the Crack500 dataset: original image, ground truth, and our prediction* ![](https://i.imgur.com/JWmiPPX.png) *Example 5: Crack segmentation on the Crack500 dataset: original image, ground truth, and our prediction* ![](https://i.imgur.com/iaVU7sJ.png) *Example 6: Crack segmentation on the Crack500 dataset: original image, ground truth, and our prediction* Before we are able to compare our results to the results of the paper, we need to construct the same evaluation convention. The paper uses the F1-score to compare their results to previously published papers and define this in the following way: ![](https://i.imgur.com/iQ3p1zZ.png =120x),where TP are true positives and FN are false negatives Additionally, the paper adopts a modification where 2 pixels near any labeled crack pixel count as true positive to account for the crack transition regions in the images that was discussed before. To achieve this evaluation we create a new ground truth label in which any pixel that is within 2 pixels near any crack pixel labeled "1", is set to "1" but only if the prediction for this pixel was already greater than 0.5. The following code will make this more clear: ```python def batch_get_new_label(old_labels, predictions): old_with_padding = nn.ReplicationPad2d(2)(old_labels) extended_labels = nn.MaxPool2d(5, stride=1)(old_with_padding) extension = extended_labels - old_labels # Transforms only the extension area to contain 0 if prediction is 0 and 1 if prediction is 1 ext_isect_pred = torch.logical_and(extension, predictions) # Combines this new extension area with old label new_label = torch.logical_or(old_labels, ext_isect_pred).float() return new_label ``` Using this newly created label, we calculate the precision, recall, and F1-score using the following functions: ```python def get_prec_recall(data_loader, network, device): network.eval() isect_sum = torch.tensor([0], dtype=torch.float32, device=device) positive_predicts_pixels = 0 positive_truth_pixels = 0 for X, y in data_loader: X, y = X.to(device), y.to(device) with torch.no_grad(): pred = (network(X) >= 0.5).float() y_ext = batch_get_new_label(y, pred) isect_sum += torch.sum(torch.logical_and(y_ext, pred)) positive_predicts_pixels += torch.sum(pred) positive_truth_pixels += torch.sum(y_ext) precision = (isect_sum + 1e-8) / (positive_predicts_pixels + 1e-8) recall = (isect_sum + 1e-8) / (positive_truth_pixels + 1e-8) return (precision.item(), recall.item()) def get_f1(precision, recall): return 2 * precision * recall / (precision + recall) ``` This yields the following results after training: Dataset|Precision|Recall|F1 score --- | --- | --- | --- CFD |0.9444|0.7404|0.8243 Crack500|0.8226|0.7391|0.7786 We will not draw a conclusion on these results since this is a replication blog, we will therefore continue with the discussion where we compare the results to the results of the original paper. ## Discussion The performance results that were achieved by the paper are shown in the following table: Dataset|Precision|Recall|F1 score --- | --- | --- | --- CFD|0.9702|0.9432|0.9555 Crack500|0.7426|0.7285|0.7327 Although our algortihm could be regarded as well performing, clearly our results differ from the paper and we will try to reason about this difference. We think that the following points might have had an influence on the performance * The paper was unclear on how they used the Crack500 images, as it's states that they used 3792 training images and 2248 test images as provided by the authors. Since the dataset consists of 500 images, preprocessing has been applied to obtain this number of images. We combined the cropped train, test, and validation sets which amounted to a number of 3368 images. There also turned out to be replicates within the sets. * Another issue that applies to both datasets is the training batch size. This batch size represents the number of images on which, for each training iteration, the loss value is calculated and the network parameters are updated according to the optimizer. The training batch size was not specified in the paper and was set to be 30 in our code. * The paper did not specify the implementation for the extended evaluation function, where any pixel within a labeled crack pixel is regarded ground truth. Our current implementation is based on our intepretation of this sentance in the paper and the given reference for this evaluation scheme. * Considering the data augmentation, the paper sketches a random change in image balance which we interpreted as a random change in brightness. * As mentioned in the introduction, the paper uses PyTorch and the *fastai* library, whereas we only use PyTorch. Consequently, the *fastai* library might have more hidden predefined hyperparameters. * Finally, different training runs yield fairly different results, the paper doesn't specify the amount runs that was carried out nor do they inform if the given results are average scores or 'best run' scores.  ## References 1. Lau, S. L., Chong, E. K., Yang, X., & Wang, X. (2020). Automated pavement crack segmentation using u-net-based convolutional neural network. IEEE Access, 8, 114892-114899. 2. Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. Chen, "Automatic road crack detection using random structured forests," IEEE Trans. Intell. Transp. Syst., vol. 17, no. 12, pp. 3434-3445, Dec. 2016. 3. F. Yang, L. Zhang, S. Yu, D. Prokhorov, X. Mei, and H. Ling, "Feature pyramid and hierarchical boosting network for pavement crack detection," IEEE Trans. Intell. Transp. Syst., vol. 21, no. 4, pp. 1525-1535, Apr. 2020. 4. Zhang, X., Rajan, D., & Story, B. (2019). Concrete crack detection using context‐aware deep semantic segmentation network. Computer‐Aided Civil and Infrastructure Engineering, 34(11), 951-971. 5. S. J. Schmugge, L. Rice, J. Lindberg, R. Grizziy, C. Joffey and M. C. Shin, "Crack Segmentation by Leveraging Multiple Frames of Varying Illumination," 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 2017, pp. 1045-1053, doi: 10.1109/WACV.2017.121. 6. Torch Contributors (2017). TORCHVISION.TRANSFORMS. pytorch.org. Retrieved 4/15/2021, from https://pytorch.org/vision/stable/transforms.html 7. O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional networks for biomedical image segmentation,'' in Proc. MICCAI, Munich, Germany, Oct. 2017, pp. 234-241. 8. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition,'' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770-778. 9. K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifers: Surpassing human-level performance on ImageNet classification," in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026-1034. 10. A. G. Roy, N. Navab, and C. Wachinger, "Concurrent spatial and channel squeeze & excitation in fully convolutional networks,"" in Proc. MICCAI, Granada, Spain, 2018, pp. 421-429. 11. F. Milletari, N. Navab, and S.-A. Ahmadi, "V-Net: Fully convolutional neural networks for volumetric medical image segmentation,'' 2016, arXiv:1606.04797. [Online]. Available: http://arxiv.org/abs/1606.04797 12. I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," presented at the ICLR, New Orleans, LA, USA, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7 13. Torch Contributors (2019). TORCHVISION.TRANSFORMS. pytorch.org. Retrieved 4/15/2021, from https://pytorch.org/docs/stable/optim.html 14. L. N. Smith, "Cyclical learning rates for training neural networks," in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2017, pp. 464-472. 15. L. N. Smith, "A disciplined approach to neural network hyper-parameters: Part 1:Learning rate, batch size, momentum, and weight decay," 2018, arXiv:1803.09820. [Online]. Available: https://arxiv.org/abs/1803.09820