PointPillars-Based Object Detection Using 3D Point Clouds from Stereo Disparity

# PointPillars-Based Object Detection Using 3D Point Clouds from Stereo Disparity **Delft University of Technology - DSAIT4125 - Computer Vision** **Simina Dragotă - Fedde Jorritsma - Marcin Popławski** **[Github Repository](https://github.com/siminadragota/StereoDepthEstimation)** ## Introduction Accurate 3D object detection is essential for safe navigation in autonomous driving, enabling vehicles to understand their surrounding environment in three dimensions. LiDAR sensors have traditionally been the primary tool for this task, producing dense 3D point clouds that support high detection accuracy. In 2017, a high-end Velodyne 64-line LiDAR sensor cost around $75,000[1] and was solely used in test vehicles. Although advancements in manufacturing have since reduced costs, LiDAR sensors still remain expensive especially for large-scale deployment. As a more affordable alternative, stereo camera systems have gained popularity. These setups estimate depth by computing disparity between images captured from two slightly offset cameras. Recent advancements have made stereo cameras more accessible, with prices now ranging from $500 to $1,000[2]. This work investigates whether stereo depth estimation can serve as an effective substitute for LiDAR in 3D object detection. Specifically, we propose generating point clouds from stereo disparity maps and using them as input to the PointPillars architecture, a deep learning model originally designed for LiDAR data. By doing so, we aim to evaluate whether stereo-derived point clouds can maintain comparable detection performance to LiDAR-based methods. ## Work that we have built upon ### Disparity images A disparity image represents the differences between corresponding pixels in two stereo images captured from different viewpoints. These differences are called disparities, and they are used to calculate the depth of objects in a scene. Larger disparities indicate objects that are closer, while smaller disparities show objects that are farther away. <div style="display: flex; justify-content: center; align-items: center; gap: 1px;"> <figure style="text-align: center;"> <img src="https://hackmd.io/_uploads/Byf8-iLpJe.png" style="max-width: 100%; height: auto;"> <figcaption>Figure 1a: Left image</figcaption> </figure> <figure style="text-align: center;"> <img src="https://hackmd.io/_uploads/H1Jd-i861g.png" style="max-width: 100%; height: auto;"> <figcaption>Figure 1b: Right image</figcaption> </figure> </div> To visualize this process, see Figure 1a and Figure 1b, which show two images of the same scene taken from two different viewpoints: the left and right cameras. By calculating the distances between corresponding pixels in these two images, the disparity image shown in Figure 2 is generated. This grayscale image uses the disparity value for each pixel to create a depth map, where lighter areas represent objects that are closer. <div style="display: flex; justify-content: center;"> <figure style="text-align: center;"> <img src="https://hackmd.io/_uploads/H1dnWjIpyg.png"style="max-width: 80%; height: auto;"> <figcaption>Figure 2: Corresponding disparity image</figcaption> </figure> </div> The disparity image plays a crucial role in 3D reconstruction and depth estimation in stereo vision. Using this disparity map, a 3D point cloud can be generated. The depth of each point in the scene can be computed using the following formula, where *d* represents disparity, *f* is the camera's focal length, and *B* is the distance between the two cameras, called the baseline: <div style="text-align: center; font-size: 24px;"> <math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>z</mi> <mo>=</mo> <mfrac> <mrow> <mi>f</mi><mo>⁢</mo><mi>B</mi> </mrow> <mi>d</mi> </mfrac> </math> <span style="font-size: 18px; margin-left: 10px;">(1)</span> </div> It is important to understand the limitations of disparity maps for depth perception. Equation 2 [3] illustrates how distance error increases quadratically with distance from the camera. This error also depends on the distance between the left and right cameras, as well as the focal length. These factors are constrained by the size of the robot or vehicle and the camera characteristics, the disparity error, influenced by the disparity map which is created by a neural network. <div style="text-align: center; font-size: 24px;"> <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>ε</mi> <mi>z</mi> </msub> <mo>=</mo> <mfrac> <msup> <mi>z</mi> <mn>2</mn> </msup> <mrow> <mi>B</mi><mo>⁢</mo><mi>f</mi> </mrow> </mfrac> <mo>⁢</mo> <msub> <mi>ε</mi> <mtext>disp</mtext> </msub> </math> <span style="font-size: 18px; margin-left: 10px;">(2)</span> </div> ### Pyramid Stereo Matching Network Traditional stereo matching identifies corresponding points using local appearance, then applies post-processing. Early CNN methods like MC-CNN improved accuracy and speed by comparing small patches, but still struggle with occlusions, repeated patterns, or textureless surfaces where global context is essential. PSMNet (Pyramid Stereo Matching Network)[4] addresses this need by introducing a new approach that better captures both local and global image features. As can be seen in Figure 3, it uses spatial pyramid pooling and dilated convolutions to enlarge the network’s view, allowing it to understand image regions at multiple scales. This helps the network form more reliable cost volumes for disparity estimation. Additionally, PSMNet includes a stacked hourglass-shaped 3D convolutional network, which refines these cost volumes by repeatedly processing them with both high-level and detailed information. This design makes it better at resolving ambiguous or difficult matching cases. <div style="display: flex; justify-content: center;"> <figure style="text-align: center;"> <img src="https://hackmd.io/_uploads/HkKvRTFTkx.png"style="max-width: 80%; height: auto;"> <figcaption>Figure 3: PSMNet Architecture</figcaption> </figure> </div> In summary, PSMNet is a fully end-to-end stereo matching framework that eliminates the need for post-processing by incorporating global context through pyramid pooling and refining disparity estimation with a robust 3D convolutional architecture. This approach enabled it to achieve state-of-the-art performance in 2018 when it was published, especially on challenging datasets like KITTI. Since our work is also based on the KITTI dataset, we selected PSMNet ([GitHub Repository](https://github.com/JiaRenChang/PSMNet)) to run inference on the stereo image pairs, generating the disparity maps required for constructing the 3D point clouds used in our pipeline. ### PointPillars Processing point clouds for 3D object detection is challenging because point clouds are sparse, irregular, and unordered. Unlike images, which are dense grids of pixels with a fixed size and structure, point clouds consist of scattered 3D points that vary in number from one frame to the next. This makes it difficult to directly apply standard deep learning models like convolutional neural networks, which expect structured input data. PointPillars[5] solve this by converting the 3D point cloud into a structured format that's easier to process. First, the 3D space is divided into a grid of vertical columns, or pillars, across the x-y plane. Each pillar collects the points that fall within its boundaries. The points in each pillar are encoded using a small neural network (like PointNet) into a fixed-length feature vector, capturing both the geometry and relative position of the points. These features are then assembled into a 2D pseudo-image, where each cell corresponds to a pillar's features. This allows the use of efficient 2D CNNs to process the pseudo-image and predict 3D bounding boxes and object classes. By transforming an unstructured 3D input into a structured 2D format, PointPillars achieve fast and accurate 3D detection, making them particularly suitable for real-time applications like autonomous driving. <div style="display: flex; justify-content: center;"> <figure style="text-align: center;"> <img src="https://hackmd.io/_uploads/rJgDEQapJg.png"style="max-width: 80%; height: auto;"> <figcaption>Figure 4: PointPillars Architecture</figcaption> </figure> </div> ## Dataset Overview: KITTI 2017 To train a network capable of 3D object detection using stereo vision, a dataset containing both 3D labels and stereo images is essential. However, such datasets are scarce, and those that include both 3D annotations and stereo images are even rarer. In the end, we chose the KITTI 2017[6] dataset, as it provides both 3D object annotations and stereo images. <div style="display: flex; justify-content: center;"> <figure style="text-align: center;"> <img src="https://hackmd.io/_uploads/rksW0UgCJg.png"style="max-width: 80%; height: auto;"> <figcaption>Figure 5: LiDAR Point Cloud (blue) and Disparity Point Cloud(red)</figcaption> </figure> </div> During qualitative analysis of both pointclouds, we observed a significant depth shift between the LiDAR and disparity point clouds, as visualized in Figure 5. This discrepancy is primarily due to errors in depth estimation from stereo disparity. As mentioned, this error increases quadratically as object are placed further away from the camera. Therefore we have restricted our focus to objects within 20 meters from the car. Following this filtering process, the dataset distribution was analyzed, and the results are presented in Table 1. The analysis reveals that cars dominate the dataset, comprising approximately 75% of all objects, while pedestrians are less frequent, and cyclists are even scarcer. <center> <table border="1"> <tr> <th>Class</th> <th>Amount</th> <th>Percentage</th> </tr> <tr> <td>Car</td> <td>11291</td> <td>74.75%</td> </tr> <tr> <td>Pedestrian</td> <td>3067</td> <td>20.30%</td> </tr> <tr> <td>Cyclist</td> <td>747</td> <td>4.95%</td> </tr> </table> <p>Table 1: Class balances</p> </center> Furthermore, Figure 6 illustrates the distribution of objects in 3D space based on their z-distance (depth) from the vehicle. Up to 10 meters, the number of objects increases linearly—likely reflecting safety-driven vehicle spacing—while between 10 and 20 meters, the distribution evens out, indicating both a balanced dataset and robust detection performance across distances. To optimize training efficiency, the dataset was partitioned into 70% for training, and 15% each for validation and testing. Additionally, the size of the disparity point clouds was reduced, as processing full-sized point clouds would have required approximately 40 hours of training time. <div style="display: flex; justify-content: center;"> <figure style="text-align: center;"> <img src="https://hackmd.io/_uploads/Hk9sDMDTJl.png"style="max-width: 100%; height: auto;"> <figcaption>Figure 6: Distribution of Object Distances in the Dataset</figcaption> </figure> </div> ## Results To evaluate the effectiveness of disparity-based point clouds in 3D object detection, we trained the PointPillars network separately on both LiDAR and disparity-generated point clouds. Both models were evaluated on the test set using Average Precision (AP) across three tasks: 2D bounding boxes, bird’s-eye-view (BEV) boxes, and 3D bounding boxes, with metrics reported for cars, pedestrians, and cyclists. To account for class imbalance and varying object sizes, we used a higher IoU threshold (0.7) for cars and a lower one (0.5) for pedestrians and cyclists. Unsurprisingly, cars showed the best detection performance in both LiDAR and disparity-based models due to their larger size and dominant presence in the dataset. As shown in Tables 2 and 3, the model trained on LiDAR point clouds clearly outperformed the one trained on disparity-based point clouds, especially in 3D and BEV evaluations. Notably, 2D bounding boxes still performed relatively well, indicating that detections are visually aligned in the image plane but inaccurate in depth, a direct consequence of stereo depth errors. 3D bounding box performance was consistently the lowest, as it requires precise localization across all dimensions, making it highly sensitive to spatial misalignments. <center> <table border="1"> <tr> <th>Class</th> <th>2D BBOX</th> <th>BEV BBOX</th> <th>3D BBOX</th> </tr> <tr> <td>Pedestrian AP@0.5</td> <td>31.79</td> <td>4.99</td> <td>4.20</td> </tr> <tr> <td>Cyclist AP@0.5</td> <td>12.92</td> <td>6.46</td> <td>6.00</td> </tr> <tr> <td>Car AP@0.7</td> <td>83.86</td> <td>62.46</td> <td>42.80</td> </tr> </table> <p>Tabel 2: PointPillars Performance op Disparity Point Cloud</p> </center> <br> <center> <table border="1"> <tr> <th>Class</th> <th>2D BBOX</th> <th>BEV BBOX</th> <th>3D BBOX</th> </tr> <tr> <td>Pedestrian AP@0.5</td> <td>69.99</td> <td>67.90</td> <td>65.89</td> </tr> <tr> <td>Cyclist AP@0.5</td> <td>87.31</td> <td>80.45</td> <td>77.38</td> </tr> <tr> <td>Car AP@0.7</td> <td>97.40</td> <td>97.89</td> <td>90.05</td> </tr> </table> <p>Tabel 3: PointPillars Performance op LiDAR Point Cloud</p> </center> Qualitative results further revealed a high number of false positives in disparity-based predictions. While actual objects were frequently detected, many background elements were incorrectly classified as pedestrians or cyclists. This suggests that the model learned meaningful features from disparity point clouds, but the inherent depth noise and lack of semantic context led to incorrect predictions and reduced quantitative performance. <div style="display: flex; justify-content: center;"> <figure style="text-align: center;"> <img src="https://hackmd.io/_uploads/S1UiVIeA1e.png"style="max-width: 150%; height: auto;"> <figcaption>Figure 6: 3D Object Detections Example Frame</figcaption> </figure> </div> ## Improvements Based on the challenges identified, several improvements can be made to improve performance with disparity-based point clouds: *1. Improve Disparity Accuracy* A major source of error is the depth inaccuracy from stereo disparity. Upgrading from PSMNet (2018) to more modern stereo networks such as GA-Net, RAFT-Stereo, or task-specific disparity models could yield more reliable depth maps. Additionally, ensuring accurate stereo calibration and image rectification is crucial to minimize systematic alignment errors. *2. Incorporate Radar for Object Localization* Currently, our model relies solely on a stereo camera setup for object localization. However, modern vehicles are typically equipped with radar sensors which are capable of accurately measuring distances to objects. The CenterFusion network [7] demonstrated improved object localization by fusing radar and camera data. This approach could improve the depth estimation of objects in 3D, instead of focusing solely on stereo vision. *3. RGB-Depth Feature Fusion* Instead of depending exclusively on depth data, integrating features from RGB images, such as textures, colors, and edge details, with point cloud information could enhance detection performance. This fusion would enable the network to better distinguish ambiguous structures and offer extra cues for more accurate object classification and localization. Moreover, training the model on full-sized disparity point clouds, rather than on a reduced version, could further boost detection accuracy. *4. Class Balancing* To improve performance on underrepresented classes like pedestrians and cyclists, class balancing should be applied during training. This can be achieved by oversampling minority classes, applying class-specific augmentation, or using loss functions that emphasize harder examples (e.g., focal loss). Another option would be to train the network on a dataset with more pedestrians or cyclists such as the View of Delft dataset. ## Conclusion The objective of this research was to evaluate whether stereo-derived point clouds can achieve detection performance comparable to LiDAR-based methods. The results show that PointPillars trained on LiDAR point clouds consistently outperform models trained on disparity-derived point clouds. This difference in performance arises mainly due to increased depth estimation errors at greater distances, which the current PointPillars network cannot adequately correct. To reduce this error, future improvements could focus on enhancing disparity map generation through more advanced networks or by integrating alternative depth-sensing technologies, such as radar. Additionally, the current model uses only depth information and does not incorporate RGB data. Modifying the network architecture to include RGB channels could further improve object detection. Finally, addressing class imbalance by oversampling minority classes or selecting datasets richer in pedestrians and cyclists would likely enhance the overall detection performance. However, a qualitative assessment of the results shows that PointPillars effectively learn meaningful features from disparity-based point clouds and can reliably detect objects in an image. With the proposed improvements and sufficient computational resources and training time on the full dense disparity point cloud, we believe this approach holds strong potential for successful use in autonomous navigation applications. ## References [1]Amadeo, R. (2017, January 10). Google’s Waymo invests in LiDAR technology, cuts costs by 90 percent. Ars Technica. https://arstechnica.com/cars/2017/01/googles-waymo-invests-in-lidar-technology-cuts-costs-by-90-percent/ [2]Teledyne FLIR. (n.d.). Blackfly S USB3. Teledyne Vision Solutions. Retrieved April 7, 2025, from https://www.teledynevisionsolutions.com/products/blackfly-s-usb3/ [3] D. Gallup, J.-M. Frahm, P. Mordohai, and M. Pollefeys, “Variable baseline/resolution stereo,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2008, doi: 10.1109/cvpr.2008.4587671. [4] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “PointPillars: Fast Encoders for Object Detection from Point Clouds,” arXiv (Cornell University), Jan. 2018, doi: 10.48550/arxiv.1812.05784. [5] J.-R. Chang and Y.-S. Chen, “Pyramid Stereo Matching Network,” arXiv (Cornell University), Jan. 2018, doi: 10.48550/arxiv.1803.08669. [6] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, "3D Object Detection Evaluation 2017". [Online]. Available: https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d [7] Ramin Nabati and H. Qi, “CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection,” arXiv (Cornell University), Jan. 2021, doi: https://doi.org/10.1109/wacv48630.2021.00157.