# Segmentation-Assisted U-Net: Enhancing Depth Estimation with SAM
| **Name** | **Student Number** |
|------------------|--------------------|
| Joris Weeda | 5641551 |
| Rami Awad | 5416892 |
| Simon Gebraad | 4840232 |
# Recommended Sources
* Segment anything model :
https://arxiv.org/pdf/2304.02643.pdf
* High Quality Monocular Depth Estimation via Transfer Learning :
https://arxiv.org/pdf/1812.11941.pdf
* NYU depth V2 dataset:
https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
## Introduction
Depth perception is an important task in many robotics applications, like autonomous driving. Traditionally, this is accomplished using expensive or bulky sensors, like LiDAR or depth cameras. However, with the rise of deep learning, depth estimation can be learned using images from a single camera: monocular depth estimation. This has the opportunity to reduce costs and allows depth perception to be implemented in many more devices equipped with just a single camera.
Various models have been proposed for monocular depth estimation, with the main classes being CNN or Transformer based. CNN-based models rely on sliding kernels to extract local image features to estimate depth. However, they are mostly limited to local image features to estimate depth, ignoring the rich contextual information available in the scene, limiting accuracy. The more modern Transformer-based models use the self-attention mechanism to increase the receptive field and extract both local and global information to improve depth estimation. However, Transformers are more expensive to train and introduce additional trainable parameters, and thus require large amounts of data. This increases training times and energy consumption in an era where the impact of Deep Learning on the environment is put into question. Hence, the need arises for a model that is lightweight whilst incorporating contextual information to improve depth estimation accuracy.
In transfer learning, a trained model is repurposed for another task, inherently being more efficient through reuse. Significant advancements are being made in computer vision that could be valuable for monocular depth estimation, like ‘Segment Anything’ (SAM) which offers a new powerful segmentation model. Retraining CNN-based networks with additional input from Segment Anything (SAM) could improve their accuracy by providing them with more spatial information, without significantly increasing model size and trainability.
In this blog, it is investigated whether the input from SAM improves depth estimation of a CNN-based model compared to the same model without SAM. We base ourselves on the work of Alhashim & Wonka (2018), who proposed a U-net based model based on DenseNet. This model was selected as it has been shown to run and produce acceptable results with limited training times and computational resources. To enable easy collaboration, Google Colab was used.
## Segment anything model (SAM)
In this project, the Segment Anything Model (SAM) is utilized for generating masks to segment objects in the RGB images. SAM is a pre-trained model that can automatically generate masks for different objects in an image. The SAM model is based on the Vision Transformer (ViT) architecture, specifically the "vit_h" variant. ViT models have shown great performance in various computer vision tasks, including image classification and object detection. SAM leverages the power of ViT to perform segmentation at the pixel level. The image at the bottom of this section provides a short interpretation of the output of the model. Important to mention is that the masks are colored, the output of SAM can also be transformed to grayscale meaning a single channel.
The authors of SAM specifically intend for it to be used in transfer learning. Hence, SAM can be run fairly easily in online environments like Google Colab. Because computational resources were quite limited, it was decided to pre-process all images from the dataset using SAM, rather than using it online during training. This is explained more in the next section.

*Figure 1: Segment anything model output*
## Dataset
The dataset utilized for this project is the NYU Depth V2 dataset, which contains RGB images and their corresponding depth maps. These samples capture various indoor scenes with different objects and structures. This dataset is widely used for depth estimation tasks in computer vision research, and was also used by Alhashim & Wonka (2018).
For this project, a specific subset of the NYU Depth V2 dataset was used, which includes fully annotated label maps. The decision to work with this subset was driven by two primary reasons. Firstly, the NYU Depth V2 dataset itself is considerably large, requiring approximately 423 GB of disk space for the raw dataset. The labelled dataset however is roughly 2.8 GB containing more information per image and less diskspace. Due to the limited RAM capacity of the Google Colab environment, it was not feasible to load and process the entire dataset. Hence, the subset of labelled data was chosen to accommodate the memory constraints.
Secondly, the selection of the subset was based on the expectation that the available labels would be valuable for evaluating the performance of the model. By utilizing the annotated label maps from the subset, it becomes possible to assess the model's accuracy and effectiveness in depth estimation tasks
### Preprocessing
This subset contained 1449 images. First, a set of 199 test images were randomly selected. The remaining 1250 images were then randomly split 80/20 into a training and validation set.
Due to the limited computational resources, the dataset was first expanded with the output of SAM. The output of SAM is a dictionary containing various information, like the masks but also the confidence score. Therefore, the outputs of SAM were processed by flattening all masks in different shades of gray into a single channel image. This image was added to the dataset, thereby expanding it with additional information, which could potentially be useful for later use. An example of the information contained in this dataset is shown in the figure below. Note that the label Map is the ground truth, human annotated label map, whereas SAM is generated by a model.

*Figure 2: Example of the images in the dataset*
With this expanded dataset, each RGB and SAM image was resized to 640 x 480, which is a requirement for the U-Net model. Like in the paper, each depth map was resized to 320 x 240. Subsequently, RGB, SAM and depth images were normalized between 0 and 1. Then, the SAM was stacked on top of the RGB image to obtain a 4-channel image. Of course, this was not done for the baseline.
## U-net model and training
A U-Net model architecture based on Alhashim & Wonka (2018) was utilized for the task of depth estimation in this project. The U-Net model is a popular and effective convolutional neural network (CNN) architecture for image segmentation tasks, first proposed by Ronneberge et al. (2015). The U-Net model employed in this project follows the traditional U-Net architecture with slight modifications to suit the depth estimation task. In essence, the model consists of an encoder pathway that captures the contextual information from the input RGB image and a decoder pathway that recovers the spatial information to generate the predicted depth map.
### Encoder
The encoder pathway utilizes a pre-trained DenseNet169 as the backbone. The DenseNet169 is a deep CNN model that has been trained on the ImageNet dataset for image classification tasks. By utilizing the pre-trained DenseNet169, the model can leverage its learned features and benefit from transfer learning. In the encoder pathway, the input RGB image is passed through the DenseNet169 backbone, resulting in 1664 15x20 feature maps.
### Decoder
The extracted features from the encoder are then passed to the decoder pathway. The decoder pathway consists of a series of upsampling blocks that progressively recover the spatial information and refine the predictions. Each upsampling block performs bilinear upsampling to increase the resolution and concatenates the upsampled features with the corresponding features from the encoder pathway.
The decoder pathway gradually reduces the number of filters as the spatial resolution increases. This reduces the model's complexity while retaining the necessary features for accurate depth estimation. The number of filters used in the decoder pathway decreases from 1664 to 832, 416, 208, and finally to 104. The last layer of the decoder pathway is a convolutional layer with a sigmoid activation function. This layer generates the final predicted depth map. The sigmoid activation ensures that the predicted values are within the range of [0, 1], representing the estimated depth values.
### Loss function
In their paper, Alhashim & Wonka (2018) propose a loss function that consists of three parts:
* **L_depth**: Point-wise L1 loss on the depth values
* **L_grad**: L1 loss defined over the image gradient of the depth image
* **L_SSIM**: Structural Similarity loss
They reason that the model should not only learn to predict correct depth values, but also correct object boundaries, as depth maps often have distinct edges at the edges of products rather than smooth gradients. This loss function was also used in this project.
To evaluate the model's performance, two metrics are used: accuracy and loss. The accuracy function quantifies the degree of similarity between the predicted depth map and the ground truth depth map (where the ground truth depth map represents the actual depth values). The accuracy metric provides an assessment of how well the model captures the true depth information. On the other hand, the loss function serves as a guiding mechanism for the model, minimizing the discrepancy between the predicted depth map and the ground truth depth map. The loss function aims to optimize the model's performance by reducing the difference between the predicted and actual depth maps.
### Training
For training of all models, like in the paper, the AdamW optimizer is used with a learning rate of 0.0001 and weight decay of 1e-6. Due to limited GPU memory, batch size first had to be reduced from 8 to 1. Reducing batch size may negatively impact training as the gradient estimation will be more unstable, as well as reducing generalization. However, after some optimizations in the dataloader function, batch size could be increased to 8. We shortly compare the results between batch sizes as well.
During training, the input images were also randomly flipped horizontally to provide data augmentation. In the paper, the authors explain that other common augmentation techniques, like vertical flips and rotations, may not contribute to the learning of useful properties of depth. Hence, only horizontal flips were used.
For the baseline training, only the RGB image was provided to the model. The weights of the DenseNet encoder were initialized using the pretrained weights on ImageNet, whereas the weights of the decoder were initialized randomly. Still, all layers were trainable. Hence, the encoder was finetuned whilst the decoder was trained from scratch.
To evaluate the input of SAM, some modifications to the network were required. The DenseNet backbone requires an input with 3 channels, whereas adding SAM would create a 4 channel input. Two approaches were considered, illustrated in the picture below:
1. Adding a single convolutional layer before the input to the encoder, downsampling 4 channels to 3 channels
2. Adding SAM after the encoder, by passing it through a small convolutional network that downsamples the (640,480,1) input from SAM to (20,15,C), where C is the number of channels (64 or 256). It is then concatenated it with the output of the encoder. This featuremap is then fed into the decoder.

*Figure 3: Different model configurations*
## Results
Here we present the results of our experiments using the SAM and U-Net model for depth estimation. We evaluated different variations of the models, described in the previous section, and analyzed their performance in terms of accuracy and loss. The following table provides an overview of each configuration and its corresponding results. Each model is shortly discussed afterwards
*Table 1: Model Performance Metrics (batch size 8)*
| Model | Configuration | Accuracy | Loss |
| ----- | ---------------------------------| -------- | ------ |
| 1 | Baseline | 0.8098 | 0.1253 |
| 2 | SAM as Extra Channel | 0.8018 | 0.1296 |
| 3 | SAM after Encoder (256 Channels) | 0.8139 | 0.1247 |
| 4 | SAM after Encoder (64 Channels) | 0.8110 | 0.1253 |
* **Model 1 | Baseline**
The baseline model, trained with a batch size of 8, achieved an accuracy of 0.8098 and a loss of 0.1253. This configuration serves as the reference point for comparing the performance of the other variations.
* **Model 2 | SAM as Extra Channel**
In this variation, an extra channel was added to the SAM model. Despite the additional information provided by the extra channel, the accuracy slightly decreased to 0.8018, and the loss increased to 0.1296 compared to the baseline. This suggests that the extra channel did not significantly improve the depth estimation performance.
* **Model 3 | SAM after Encoder (256 Channels)**
Using the SAM model as input after the encoder using 256 channels exhibited slightly improved accuracy of 0.8137 and a loss of 0.1258 after the second training round. This variation shows indicates a potential in enhancing the depth estimation performance compared to the baseline. As this configuration showed the best result among all the configurations we decided to execute another testing round, with similar results. The increased channel capacity of the input after the encoder potentialy enables the model to capture more detailed features and produce more accurate depth predictions.
* **Model 4 | SAM after Encoder (64 Channels)**
Alternatively, we explored the configuration where the output of SAM is used after the encoder which utilizes 64 channels. The model achieved an accuracy of 0.8110 and a loss of 0.1253. While the accuracy is slightly lower than that of the after-encoder configuration with 256 channels, it still demonstrates a slight improvement compared to the baseline. This suggests that even with a reduced channel capacity, the after-encoder input of SAM can effectively capture more relevant information with the current training configuration for depth estimation.
### Influence of Batch size
In the initial configurations, the batch size for training was limited to 1 due to the large dataset and models combined with limited training resources. However, through improvements in code efficiency, we were able to increase the batch size up to 8. By comparing the performance of the model using different batch sizes during a training session, we observed that the model faced challenges in generalizing when using low batch sizes, and such batch sizes were more susceptible to overfitting.
Visually, the image below highlights the notable differences between the two batch sizes. It is evident that the increased batch size not only demonstrates better generalization but also converges faster during training.

*Figure 4: Difference in batch sizes*
### Influence of Encoder complexity
To test the influence of the encoder complexity on the benefit of adding SAM, the best performing model was also tested using a smaller encoder, namely DenseNet-121. It can be hypothesized that models with smaller encoders will benefit more from additional spatial information from SAM, as smaller encoders may be less able to extract useful spatial information on their own. The results of this evaluation are shown in the table below.
*Table 2: Comparing different encoders*
| Encoder | Configuration | Accuracy | Loss |
| -----------------| ---------------------------------| -------- | ------ |
| Densenet-169 | Baseline | 0.8098 | 0.1253 |
| Densenet-169 | SAM after Encoder (256 Channels) | 0.8139 | 0.1247 |
| | *Difference* | *+0.0041* | *-0.0006*|
| Densenet-121 | Baseline | 0.7965 | 0.1270 |
| Densenet-121 | SAM after Encoder (256 Channels) | 0.8041 | 0.1307 |
| | *Difference* | *+0.0076* | *-0.0037*|
As expected, the overall performance drops slightly with a less complex encoder. However, the difference between baseline and SAM-model increases with a simpler encoder, suggesting that the extra spatial information provided by SAM has more of an effect.
Overall, our experimental results demonstrate the effectiveness of incorporating SAM and U-Net DepthNet for improved depth estimation. The SAM model after the encoder using 256 channels showcased the highest accuracy among the variations, closely followed by the same configuration but with 64 channels. These findings indicate that increasing the channel capacity in the after encoder configurations slightly enhances the model's ability to capture intricate depth features. Additionally, the baseline results and the SAM model with an extra channel provide valuable insights into the impact of different architectural choices on the depth estimation performance.
## Qualitative analysis
By visually inspecting and analyzing the model's output with the ground truth and examining different model configurations, we seek to assess its capability to accurately capture depth information and fine-grained details. We are especially interested in the fact that the SAM-segmentation allows for better grouping of the pictures meaning it should be better in distinguishing objects and therefore also the depth to those objects. We introduce two examples where the baseline is shown and the output of the configuration with SAM as an extra channel and SAM added after the encoder, which have been previously explained.

*Figure 5: Baseline and RGB-image, image example 1 (the closet)*

*Figure 6: SAM as extra channel, image example 1 (the closet)*

*Figure 7: SAM after encoder, image example 1 (the closet)*
The baseline model exhibits limited representation of depth details, particularly in the case of the closest object visible in the RGB image. However, both configurations incorporating the SAM output as an input demonstrate significantly enhanced detail. Notably, the edges of the closet and bed appear much sharper, indicating that SAM effectively outlines these objects.

*Figure 8: Baseline and RGB-image, image example 2 (the chair)*

*Figure 9: SAM as extra channel, image example 2 (the chair)*

*Figure 10: SAM after encoder, image example 2 (the chair)*
In the second example, our attention is drawn to the chair and the bookshelf situated on the left-hand side of the image. Once again, the baseline model encounters difficulties in accurately distinguishing between the sections of the bookshelf and capturing the lower portion of the chair. Interesting to see is the SAM output also exhibits challenges in representing the lower side of the chair. However, despite these challenges, both SAM-configurations perform considerably well, as evidenced by the increased level of detail observed in the bookshelf and chair regions.
## Discussion
Though the results show some improvement with regards to depth estimation, the difference with the baseline is small. There are various potential causes for this.
Firstly, the limited size of the dataset. To enable input of SAM into the model, extra model layers were added which also require training. Compared to the DenseNet encoder, these layers have to be trained from scratch. The limited size of the dataset could mean that these layers are unable to extract the most useful features from the extra SAM input, limiting the ability of the model to use the additional information.
This is compounded by the limiting compute power available, meaning the size of the extra models was limited. For example, increasing the number of channels of SAM to 512 was not possible due to limited GPU memory. Additionaly, it was not possible to train on large datasets as training times would be very long.
Another possibility is that the choice of the encoder, Densenet-169, has limited the added benefit of SAM. Densenet-169 is a very deep Conv-Net, going from (640x480x3) to (20x15x1664). Potentially, this allows it to extract spatial features very well already, meaning that the spatial information SAM adds is redundant. A simpler encoder, Densenet-121 was also evaluated, and it was found that the added benefit of SAM increased. However, it should be noted that Densenet-121 is still a very large model which potentially still limited the benefit of SAM.
Hence, future research could focus on assessing the benefit of SAM with much smaller encoders. Additionally, given the promise of method 2 for increasing accuracy of the model, it would be interesting to test whether further increasing the size of the ConvNet applied to SAM also further improves results. This would probably require a larger dataset, to ensure this additional model is trained properly.
## Conclusion
In conclusion, incorporating the Segment-Anything Model (SAM) into the U-Net model to enhance depth estimation showed small improvements in some tested variations. The configuration where SAM was applied post-encoder with 256 channels,showed the overall best results with the highest accuracy and lowest loss score, and thus despite minor enhancements, surfaced as the most accurate model outperforming the baseline model.
Constraints, such as limited dataset size and computational resources, presented significant challenges during our study. A clear example of these challenges was the need to reduce the batch size initially to circumvent GPU memory constraints. However, subsequent optimization allowed for an increase in batch size to 8.
Our results highlight the benefits of combining SAM with U-Net for depth estimation applications. The versions with SAM after the encoder showed improvements over the baseline, albeit slight ones, which indicate the combined approach's potential. In particular, the SAM implementation enhanced the model's capacity to predict depth by supplying useful spatial information.
However, the modest improvements show that SAM and U-Net integration's full potential is still unrealized. Future work should investigate the effects of bigger channel sizes post-encoder, experiment with other encoder configurations, and further optimize the implementation of SAM within the U-Net model. It may be possible to improve the model's performance and generalizability by combining this with more powerful computing capabilities and bigger training datasets.
When looking at the results visually, we can conclude that the visual analysis of the model's output and comparison of different configurations demonstrate the efficacy of incorporating SAM for depth estimation. The baseline model exhibits limitations in capturing fine-grained details, particularly in objects that are closer to the camera. However, both SAM configurations,with SAM added after the encoder and especially SAM as an extra channel, showcase improved performance. The SAM-enhanced models effectively outline objects, resulting in sharper edges and increased detail compared to the baseline.Where the output generated by SAM as an extra channel even show more details in the depth image as observed in figure 8. These findings support the hypothesis that SAM aids in better grouping and distinguishing objects, thereby enhancing depth estimation.
Reflecting on these results, it is clear that the incorporation of SAM into the U-Net model presents opportunities for improving depth estimates. Despite the slight benefits found in this study, the improvements across several variants points in the direction of a promising area for further investigation. The increased benefit of SAM for a smaller encoder shows a promising direction for combining SAM with small, lightweight models, which can enable accurate depth estimation on simple hardware. With further refinement and resource augmentation, we are optimistic that the integration of SAM and U-Net could yield significant advancements in the field of depth estimation tasks.
## References
Alhashim, Ibraheem & Wonka, Peter. (2018). High Quality Monocular Depth Estimation via Transfer Learning.
Ronneberger, O., Fischer, P., Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science(), vol 9351. Springer, Cham. https://doi-org.tudelft.idm.oclc.org/10.1007/978-3-319-24574-4_28