PolarStream: state-of-the-art simultaneous detection and segmentation algorithm for streaming lidar

# PolarStream: state-of-the-art simultaneous detection and segmentation algorithm for streaming lidar ## News Our new open-source streaming based algorithm is now released [here](https://github.com/motional/polarstream)! This repository is not only the official implementation of *[1]Chen, Q., Vora, S. and Beijbom, O., 2021. PolarStream: Streaming Object Detection and Segmentation with Polar Pillars. Advances in Neural Information Processing Systems, 34.* but also includes reimplementation of *[2]Han, Wei, et al. "Streaming object detection for 3-d point clouds." European Conference on Computer Vision. Springer, Cham, 2020. [3]Frossard, Davi, et al. "StrObe: Streaming Object Detection from LiDAR Packets." arXiv preprint arXiv:2011.06425 (2020).* Our PolarStream matches batch-based processing in terms of accuracy while offering a dramatic reduction in latency. This is of critical importance for real-time systems like AVs. ## What is streaming? Lidar is inherently streaming data as it arrives sequentially and incrementally. Streaming object detection process and detect objects as soon as each sector arrives. ![](https://i.imgur.com/LMyETkY.gif) **Streaming Lidar Object Detection vs Traditional (full-sweep) Lidar Object Detection** Now, let’s compare the differences between a traditional vs a streaming object detection approach. An example of a traditional lidar object detection algorithms is shown in the figure below. Here, we wait for the lidar sensor to finish a complete scan and then detect objects all around the ego-vehicle in a single shot. This introduces an additional data capture latency and the overall latency is a combination of the data capture latency (represented by the pink part of the bottom plot) and the processing latency (shown by the gray part of the bottom plot). In streaming lidar object detection, we don’t wait for the entire scan to finish and instead process lidar data as soon as it arrives. The animation on the right shows this clearly. As the lidar is rotating, we are detecting objects incrementally in those wedge shaped regions, without waiting for the entire scan to complete. This substantially reduces the end to end latency as the data capture time is significantly reduced. ![](https://i.imgur.com/K2AM3Y2.gif) ## Why is streaming important? Since streaming based approaches minimize the end-to-end latency, they correctly represent the current state of the world. For a non-streaming object detection approach, when we detect an object, we are detecting the past position of that object. During the time it takes for the lidar to finish it’s scan, fast moving object have already moved to a very different location. The image below shows this clearly. It shows that the lidar is operating at 10 Hz and the packets are captured every 10 ms. In a streaming based approach, we will detect the object as soon as the second packet is captured after 20 ms. For a traditional system, because we will wait for the entire 100 ms for the lidar scan to complete, the detection (as shown by the red box) is the outdated state of the agent when it has already moved to a new position. ![](https://i.imgur.com/921dl0n.png) Figure from [3] ## Challenges for Streaming Streaming object detection suffers from several challenges. The first challenge is the inefficient input representation. ### Inefficient Input Representation The actual input region for each streaming lidar sectors are wedge-shape, like a slice of a pizza, as shown in the gray area in the figure below. However, efficient object detectors usually requires rectangular regions as input, as shown in the red boxes. Using the rectangular region may cause several problems. First, it requires extra work to hand-design the input rectangles because the rectangle sizes can change at different sectors. Second, it wastes at least half of a rectangle because you see half of the rectangle is out of the gray area. That means half of the rectangle is empty. ![](https://i.imgur.com/0PxbeMw.png) ### Limited Spatial View The second challenge for streaming object detection is the limited spatial view from each lidar sector. In this figure, an object shown in the green box may not be fully captured within one sector. Detecting such an object from its partial observation is difficult. ![](https://i.imgur.com/1rMUePg.png) ## PolarStream Overview ### Polar Representation Previous methods from Uber[3] and Waymo[2] both used cartesian coordinate system and rectangular regions so they waste memory and computation. In contrast, we propose to use polar representation and use wedge-shape grids to cover the input region. It is more compact and efficient than rectangles. ![](https://i.imgur.com/EtmduRO.gif) ### Multi-scale Context Padding Previous methods both use an additional memory module to aggregate from each sector and detect from the aggregated information, so their methods are able to get information from multiple sectors. #### Trailing-Edge Padding ![](https://i.imgur.com/pVHSeBq.gif) Our method, in contrast, does not require any extra module. We found by virtue of using polar coordinates, when we unfold the input sector to a rectangular feature map along range and azimuth dimension, i.e. the r and theta dimenstion, the neighboring sectors are spatially connected to each other along the theta dimension. In the input sector, we want to pad the pink wedge to the trailing-edge of second sector. We can do this in feature map level by padding the pink column to the left of the second feature map. We can replace zero-padding, which is commonly used before convolution, by the features from previous sectors. We call padding from the previous sector trailing edge padding. And we repeat trailing-edge padding until one full scan is completed. The advantange here is we introduce no extra modules and thus no extra latency or computation. We apply context padding before every convolution, this allows the network to see multi-scale context from neighboring sectors. We talked about padding from previous sector. The question is is it possbile to pad from future sectors? #### Bidirectional Padding ![](https://i.imgur.com/CSaz5lN.gif) The answer is no but there is a walkaround. We can not access the future but we approximate the future by history. We involve diffrent time frames here. In addition to trailing edge padding, We merge full-sweep feature maps from past time frame and warp it to current time frame using ego motion compensation. Then we can pad the corresponding region from the following sector. We call padding from both precedding sector and following sector bidirectional padding. ## Comparison with Previous Streaming Methods ![](https://i.imgur.com/GGtvFTS.png) The graph here shows the comparison of our polarstream with Waymo and Uber’s methods on the NuScenes dataset. The horizontal axis is the end-to-end latency, meaning lower is better. The vertical axis is panoptic quality. It is a metric for perception performance, meaning the higher the better. By comparing the cases when we slice the entire scene into 1,2,4,8,16 and 32 sectors, our polarstream, is always not only more accurate but also faster than previous methods. It’s interesting to notice that, as the number of sector increases, each sector becomes smaller so the challenge for limited spatial view, as we mentioned earlier, becomes more severe. Detecting from smaller sectors becomes more and more challenging for previous methods (shown in blue and orange) because you see a clear trend that the accuracy drops. But our polarstream maintains almost the same or even better accuracy for smaller sectors. This is surprising because the community used to believe that streaming improves the speed by compromising the accuracy. But we first time in literature show that streaming can be both faster and more accurate. ## Comparison with SOTA Full-sweep Methods 3D Object Detection on NuScenes val | Model | det mAP | |-----------------------|----------------------| |CenterPoint-VoxelNet[4]|58.4| |PolarStream-VoxelNet|57.7| Semantic Segmentation on NuScenes val | Model | seg mIoU | |-----------------------|----------------------| |Cylinder3D[5]|76.1| |PolarStream-VoxelNet|77.7| Note: these are our updated numbers in code repository. CenterPoint-VoxelNet and PolarStream-VoxelNet are implemented with single-group detection heads. ## References [1]Chen, Q., Vora, S. and Beijbom, O., 2021. PolarStream: Streaming Object Detection and Segmentation with Polar Pillars. Advances in Neural Information Processing Systems, 34. [2]Han, Wei, et al. "Streaming object detection for 3-d point clouds." European Conference on Computer Vision. Springer, Cham, 2020. [3]Frossard, Davi, et al. "StrObe: Streaming Object Detection from LiDAR Packets." arXiv preprint arXiv:2011.06425 (2020). [4]Yin, T., Zhou, X. and Krahenbuhl, P., 2021. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11784-11793). [5]Zhou, H., Zhu, X., Song, X., Ma, Y., Wang, Z., Li, H. and Lin, D., 2020. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550.