# Drivable Area and Obstacle Segmentation in Autonomous Driving Scenario
Our Res-U-Net model: https://www.kaggle.com/code/psq0512/res-u-net
Ziang Liu 5978866
Siqi Pei 5964377
Xiaotong Li 5965373
## 1. Introduction and Motivation
Nowadays, autonomous driving is no longer a concept, it has been very practical into people's lives, just like people can already ride waymo through taxi software. Without a doubt, the most crucial part of the intricate unmanned driving system is camera recognition, which functions as its eyes.
For target recognition in complex environments, the most important performance indicators are undoubtedly the speed and accuracy of recognition. Although some high-frequency algorithms such as YOLO have been well applied in the field of automatic driving, bounding box has certain defects, that is, it can't accurately get the contour of the target, which is very useful in 2D-3D correspondence, allowing for knowing the 3D position and even the geometrical properties of the target. Therefore, in this project our main focus is the application of segmentation method in the field of automatic driving recognition.
The research and technologies under this topic mainly focus on two aspects:
drivable area and obstacle detection. In this blog, we delve into these two issues:
1. **Drivable area detection:** In this section, we train and compare the current YOLOP and U-Net models. By combining their advantages, we present **our own model, Res-U-Net**. Finally, we evaluate these three models on a new dataset: View of Delft.
2. **Obstacle detection:** In this section, we train and compare two semantic segmentation algorithms: YOLACT and Mask R-CNN. We conduct ablation studies on the models to identify key structures that determine prediction speed and accuracy.
In addition, in the high-speed working condition like autonomous driving, the fps of the algorithm is very important. In this project we try to explore a trade-off between speed and accuracy of some popular computer vision methods, and evaluate them in the context of autonomous driving. Finally, we also explore how well these methods adapt to environmental changes.
## 3. Segmentation for drivable area detection
In this section, our focus is on employing semantic segmentation for drivable area detection. The segmentation's effectiveness is illustrated in the image below, where the green area denotes the road ahead of the vehicle. Furthermore, we've evaluated three neural network architectures, U-net, YOLOP and **our own Res-U-Net**, comparing their structures and performance in this chapter.
<div style="display: flex; justify-content: center;">
<div style="flex: 1; margin-right: 10px;">
<img src="https://hackmd.io/_uploads/ryYuzT2VR.jpg" alt="Original Image" style="width: 100%;">
<p style="text-align: center;">Original Image</p>
</div>
<div style="flex: 1; margin-left: 10px;">
<img src="https://hackmd.io/_uploads/r1NwfanNA.jpg" alt="Image after segmentation" style="width: 100%;">
<p style="text-align: center;">Image after segmentation</p>
</div>
</div>
### 3.1 Implementation
In this chapter, we've employed three types of neural networks, namely U-net, YOLOP and **our own Res-U-Net**, for segmentation purposes. All networks are designed for semantic segmentation. U-net, characterized by its simple fully convolutional structure, is particularly suited for high-frequency detection, aligning well with the demands of autonomous driving.
Compared to U-net, YOLOP features a relatively complex structure, with two main distinctions:
1. YOLOP utilizes CSPDarknet as its backbone for feature extraction, as illustrated by the red boxes in the diagram below, adopting the structure of ResNet. This enables the model to enhance its generalization capabilities to some extent. Additionally, by substituting convolutional layers for pooling layers during downsampling, it reduces information loss and strengthens non-linear capabilities.
2. YOLOP adopts the FPN architecture. Unlike U-net, it merges feature maps of different scales for feature propagation, as illustrated by the blue boxes in the diagram below.
**Network architecture**
<div style="text-align:center">
<img src="https://hackmd.io/_uploads/H11FGhaEC.jpg" alt="unet_structure" style="width:50%;">
<p style="text-align:center;">Structure of the U-net</p>
</div>
<div style="text-align:center">
<img src="https://hackmd.io/_uploads/S1ZLo264C.jpg" alt="YOLOP" style="margin-left:30px;">
<p style="text-align:center;">Structure of the YOLOP</p>
</div>
Combining the advantages and disadvantages of the above two models, we modified U-net and added the structure of Resnet to improve its generalization ability
**Our Res-U-Net**

<p style="text-align:center;">U-net(left) and our model(right)</p>
**Training process**
We trained these three models on the BDD100K dataset [^1], which collects data on roads in different regions, times, seasons, and types in the United States. We analyzed the performance of the model using the View of Delft dataset [^2]. Finally, we also conducted an analysis of the model's robustness by manually modifying image blur and adding occlusions.
### 3.2 Result and comparasion
All the models are tested on the view of Delft dataset, the outcome of these three models are shown below:
<div style="display: flex; justify-content: center;">
<div style="flex: 1; margin-right: 10px;">
<img src="https://hackmd.io/_uploads/rynNix7rR.jpg" alt="blurred_image" style="width: 100%;">
<center>YOLOP</center>
</div>
<div style="flex: 1; margin-left: 10px;">
<img src="https://hackmd.io/_uploads/BJCiMOFBR.png" alt="masked_image" style="width: 100%;">
<center>U-net</center>
</div>
<div style="flex: 1; margin-left: 10px;">
<img src="https://hackmd.io/_uploads/Skoif_tH0.png" alt="masked_image" style="width: 100%;">
<center>Res-Unet(ours)</center>
</div>
</div>
We calculated the evaluation metrics of the two models on the BDD100k test set, and the results are shown in the table below.
<center>
| Model | ACC | IOU | Frequency/Hz (RTX 4060) |
| :------:| :-----:| :-----:| :-------: |
| YOLOP | 0.974 | 0.861 | 6 |
| U-net | 0.825 | 0.708 | 125 |
| Res-U-Net (Our model ) | 0.837 | 0.724 | 109 |
</center>
Based on performance, YOLOP emerges as the top-performing model; however, its applicability to real-world autonomous driving scenarios is hindered by its low frequency. In addition, although U-net exhibits a significantly higher frequency, its generalization performance falls short.
To enhance its generalization capabilities, we **integrate the ResNet structure into the model**, which lightly increase the generalization capability with a little sacrifice of speed. Further elucidation on the rationale behind incorporating the ResNet structure will be provided in the ablation study section.
## 4. Segmentation for obstacle detection
In this section we compare two popular segmentation methods: Mask R-CNN[^4] and YOLACT[^3].
### 4.1 Mask R-CNN
**Network architecture**
<p style="text-align: center;">Mask R-CNN architecture ([image credit](https://zhuanlan.zhihu.com/p/61202658))</p>
Mask R-CNN is based on the Faster R-CNN. It adopts the same two-stage procedure, with an identical first stage (which is RPN). The input image first goes through the backbone CNNs to get a feature map, and then it is sent to RPN to generate region proposals. After that it does RoI alignment to addresses the misalignment issues caused by quantization in the original RoI Pooling. It ensures that the RoI regions are well-aligned with the input feature maps, preserving the spatial information more accurately.
In the second stage, in parallel to predicting the class and box offset using Fully Connected layers, Mask R-CNN also has a Mask head, using a FCN to output a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions.
### 4.2 YOLACT
**Network architecture**
<p style="text-align: center;">YOLACT architecture([image credit](https://arxiv.org/pdf/1904.02689))</p>
YOLACT, which stands for "You Only Look At Coefficients", whose idea is to add a mask branch to an existing one-stage object detection model in the same vein as Mask R-CNN does to Faster R-CNN, but without an explicit feature localization step.
The image is firtly processed through the Feature Backbone and Feature Pyramid to create multi-scale feature maps. Then it diverges to two branches. The first branch called Protonet, produce a set of image-sized “prototype masks”(So no need for localization). The second adds a Prediction Head to the object detection branch to predict a vector of “mask coefficients” for each anchor that encode an instance’s resentation in the prototype space. Finally, each instance that survives NMS is assembled to construct a mask for that instance by linearly combining the work of these two branches.
### 4.3 Comparison between Mask R-CNN and YOLACT
We test the two model with both Resnet50-FPN as backbone on the view of Delft dataset, and in the aspect of accuracy, Mask R-CNN shows a great advantage over YOLACT, while the latter wins a round back in the aspect of FPS
| Model | mAP for mask | Frequency/Hz (Mac M1 CPU) |
| :------:| :-----:| :-------: |
| Mask R-CNN | 34.7 | 0.6 |
| YOLACT | 28.5 | 1.8 |
<div style="display: flex; justify-content: center;">
<div style="flex: 1; margin-right: 10px;">
<img src="https://hackmd.io/_uploads/H1DpH5irR.jpg" alt="blurred_image" style="width: 100%;">
<center>Mask R-CNN</center>
</div>
<div style="flex: 1; margin-left: 10px;">
<img src="https://hackmd.io/_uploads/HymJ8qsHA.jpg" alt="masked_image" style="width: 100%;">
<center>YOLACT</center>
</div>
</div>
<center>YOLACT struggles on small and occluded instances</center>
\
\
Due to our hardware limit, we failed to exploit the speed of YOLACT to its maximum. As a reference, we put this figure here to illustrate how fast this one stage instance segmentation method can be .
<p style="text-align: center;">Speed-performance trade-off for various instance segmentation methods on COCO([image credit](https://arxiv.org/pdf/1904.02689))</p>
## 5. Further experiment and abliation study
### 5.1 Robustness to environmental change
Implementing recognition functionality is just the first step in automated driving. Considering the complexity of driving environments, algorithms should be able to adapt to various special circumstances such as **extreme weather, night driving, camera blurring**, etc. In this chapter, we tested detection under the four environments illustrated in the figure below:
<div style="display: flex; flex-wrap: wrap; justify-content: center;">
<div style="flex: 1 1 45%; margin: 5px;">
<img src="https://hackmd.io/_uploads/BJ0ex7C4R.jpg" alt="blurred_image" style="width: 100%;">
<center>Blur image</center>
</div>
<div style="flex: 1 1 45%; margin: 5px;">
<img src="https://hackmd.io/_uploads/Skizxm0ER.jpg" alt="masked_image" style="width: 100%;">
<center>Partially blocked</center>
</div>
<div style="flex: 1 1 45%; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJOcg70NA.jpg" alt="rainy" style="width: 100%;">
<center>Rainy condition</center>
</div>
<div style="flex: 1 1 45%; margin: 5px;">
<img src="https://hackmd.io/_uploads/rJgRlmCV0.jpg" alt="night_time" style="width: 100%;">
<center>Night time driving</center>
</div>
</div>
The prediction of YOLOP is shown in the figure below:
<div style="display: flex; justify-content: center; gap: 20px;">
<div style="flex: 1;">
<img src="https://hackmd.io/_uploads/ryIREmCNC.jpg" alt="blurred_image" style="width: 100%; display: block;">
<center>Blur image</center>
</div>
<div style="flex: 1;">
<img src="https://hackmd.io/_uploads/BJL1r7RER.jpg" alt="masked_image" style="width: 100%; display: block;">
<center>Partially blocked</center>
</div>
<div style="flex: 1;">
<img src="https://hackmd.io/_uploads/S1RJH70VA.jpg" alt="rainy" style="width: 100%; display: block;">
<center>Rainy condition</center>
</div>
<div style="flex: 1;">
<img src="https://hackmd.io/_uploads/ByneBXR4C.jpg" alt="night_time" style="width: 100%; display: block;">
<center>Night time driving</center>
</div>
</div>
The prediction of U-Net is shown in the figure below:
<div style="display: flex; justify-content: center; gap: 20px;">
<div style="flex: 1;">
<div style="margin-right: 10px;">
<img src="https://hackmd.io/_uploads/B19EijdrC.jpg" alt="blurred_image" style="width: 100%; display: block;">
<center>Blur image</center>
</div>
</div>
<div style="flex: 1;">
<div style="margin-left: 10px; margin-right: 10px;">
<img src="https://hackmd.io/_uploads/rk02pidHA.jpg" alt="masked_image" style="width: 100%; display: block;">
<center>Partially blocked</center>
</div>
</div>
<div style="flex: 1;">
<div style="margin-left: 10px; margin-right: 10px;">
<img src="https://hackmd.io/_uploads/Byiuoi_BA.jpgg" alt="rainy" style="width: 100%; display: block;">
<center>Rainy condition</center>
</div>
</div>
<div style="flex: 1;">
<div style="margin-left: 10px;">
<img src="https://hackmd.io/_uploads/r1kIsouH0.jpg" alt="night_time" style="width: 100%; display: block;">
<center>Night time driving</center>
</div>
</div>
</div>
The prediction of Res-U-Net(Ours) is shown in the figure below:
<div style="display: flex; justify-content: center; gap: 20px;">
<div style="flex: 1;">
<div style="max-width: 300px; margin: 0 10px;">
<img src="https://hackmd.io/_uploads/r1JzojuBR.png" alt="blurred_image" style="width: 100%; display: block;">
<center>Blur image</center>
</div>
</div>
<div style="flex: 1;">
<div style="max-width: 300px; margin: 0 10px;">
<img src="https://hackmd.io/_uploads/ByJ36o_HR.jpg" alt="masked_image" style="width: 100%; display: block;">
<center>Partially blocked</center>
</div>
</div>
<div style="flex: 1;">
<div style="max-width: 300px; margin: 0 10px;">
<img src="https://hackmd.io/_uploads/S17JiiuSC.jpg" alt="rainy" style="width: 100%; display: block;">
<center>Rainy condition</center>
</div>
</div>
<div style="flex: 1;">
<div style="max-width: 300px; margin: 0 10px;">
<img src="https://hackmd.io/_uploads/BJic6ouBC.jpg" alt="night_time" style="width: 100%; display: block;">
<center>Night time driving</center>
</div>
</div>
</div>
Overall, all models demonstrate strong generalization capabilities for handling blurred images and night-time environments effectively.
However, YOLOP struggled notably under rainy conditions, despite its reputation for robustness. This unexpected outcome suggests that YOLOP's complexity may have led to overfitting on the current dataset, particularly since rainy conditions introduce global changes across the entire image.
Conversely, in scenarios involving partially blocked images, U-net exhibited the least effective performance compared to the other models. **This is likely due to the ResNet-based models' ability to integrate both global and local information, which helps mitigate the impact of partially blocked areas**.
### 5.2 Ablation study
#### 5.2.1 Comparasion of our own Res-U-net with U-net
In chapter 3, we compared the two models - U-net and YOLOP, and found that although there was not a siginificant diference in accuracy between them, YOLOP exhibited noticeably stronger generalization models (although not on rainy weather), we discovered that YOLOP employs CSPDarknet as its Backbone, which utilized the ResNet structure. **The primary function this architecture is to fuse features with more global properties and local features by skipping certain convolutional layers**. Metaphorically, this fusion of two types of information enables the network not only to rely on object features for identification but also to leverage relationships between objects and ohter parts of the entire image for indentification. This combination of global and local features empowers the network with stronger generalization capabilities.
However, considering the **excessive number of parameters** in YOLOP and the model's complexity relative to the task, we opted to **start with U-net and introduced the ResNet structure**, building our own neural network Res-U-Net. Ultimately, through experiments, the model's generalization capability was indeed enhanced, thus validating our earlier conjecture.
Additionally, as demonstrated in Chapter 5.1 of our study, our model exhibits good generalization capabilities, particularly excelling in extreme conditions such as rainy and partially blocked environments compared to both YOLOP and the standard U-net.
#### 5.2.2 Comparison between Mask R-CNN and YOLACT
In chapter 4 we tested Mask R-CNN and YOLACT both on the veiw of Delft dataset, and showed their strengths and weaknesses respectively. Here we want to explain this performance based on our understanding about their architectures.
So first, **Mask R-CNN is a two-stage method**. It has an RPN to offer region proposals, so the Mask head has a more compact input, while in YOLACT the Protonet receives the entire feature map as input.
**Second, the Mask R-CNN divides the Mask head and bounding box regressor into two branch, making them dedicated to their own task**. In contrast, In YOLACT the two job are some kind of integrated. These two differences lead to a higher accuracy of Mask R-CNN, but also the higher efficiency of YOLACT, because it does not have a region proposal session, the Mask head (Protonet) only has one forward propogation while in Mask R-CNN each region proposal has one.
## 6. Conclusion
In our blog, we tackled two key tasks in autonomous driving: drivable area and obstacle segmentation.
For drivable area detection, we began by evaluating YOLOP and U-Net. Through detailed analysis, we integrated ResNet into U-Net to create our own Res-U-Net. This model performed well on the test set, achieving high accuracy consistently and. Moreover, Res-U-Net demonstrated superior generalization, particularly in extreme weather conditions.
In the context of obstacle segmentation, we did a comparative analysis of Mask R-CNN and YOLACT which provided valuable insights. While Mask R-CNN showcased higher accuracy, YOLACT outperformed in terms of processing speed, highlighting the trade-offs between these models. Overall, segmentation plays a pivotal role in enhancing the reliability and safety of autonomous driving systems, ensuring accurate detection and adaptation to diverse driving environments.
## Reference
[^1]: F. Yu et al., "BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 2633-2642, doi: 10.1109/CVPR42600.2020.00271.
[^2]: A. Palffy, E. Pool, S. Baratam, J. F. P. Kooij and D. M. Gavrila, "Multi-Class Road User Detection With 3+1D Radar in the View-of-Delft Dataset," in IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4961-4968, April 2022, doi: 10.1109/LRA.2022.3147324.
[^3]: Bolya D, Zhou C, Xiao F, et al. Yolact: Real-time instance segmentation[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 9157-9166.
[^4]: He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.