Reviewer ijm6 [second set, rating 6]

# Reviewer ijm6 [second set, rating 6] - the large accuracy gap between non-streaming and streaming methods. Please comment on the accuracy of polarstream vs. the state-of-the-art for accuracy metrics. The gap comes from the fact that we used a lighter backbone of PointPillars which operates on the pillar feature encoder while others use the much slower ResNet-like backbone on top of 3D voxel encoder. It is not a gap between streaming and non-streaming methods. We also tried our full-sweep polarstream with the same 3D ResNet-like encoder as in CenterPoint (named PLS1 heavy in the following table) and we were able to match/beat the SOTA models for detection (CenterPoint) and segmentation (Cylinder3D) on the nuScenes validation set. In our work we focus on onboard applications like streaming and those heavy backbones cannot run onboard. So we choose the encoder and backbone from the PointPillars [15] model. | Methods | det mAP | seg mIoU | runtime | #parameters| | -------- | -------- | -------- | -------- | -------- | | CenterPoint | 56.4 | | 11Hz | 149MB| | Cylinder3D | | 76.1 |11Hz(reported) 2Hz(reproduced)| 215MB| | Ours-PLS1 | 51.2| |26Hz|65MB| |Ours-PLS1 | |73.8|34Hz|65MB| |Ours-PLS1 heavy|56.2| |11Hz|149MB| |Ours-PLS1 heavy||76.8| 15Hz|149MB| - the not entirely certain reproduction of the baselines. Please gauge the fidelity/quality of the reproduced baselines in the context of their originally reported results, to ensure the new results are good indicators w.r.t. existing methods. Previous methods did not release their code and they either worked on a private dataset or the dataset details are not released. Since there is no streaming dataset available to us, we decided to make a streaming benchmark out of NuScenes and put all methods under the same condition for comparison. We make dataset, backbone, augmentation, optimization all the same in order to focus on how to address limited spatial view in streaming sectors. The benefit of using NuScenes dataset is there are ten classes of different sizes so we can observe how limited spatial view affect objects of different sizes(Supplementary, table 7&8). We tried our best to get the best results out of reimplemented methods and we got feedback from the authors of STROBE for implementation details. We also emailed the authors of Han et. al. but did not get reply. We will release the code for streaming dataset, our method and our reimplemented methods so everyone can base their work on ours. Previous methods not releasing their code should not block further research. If it is a blocker for our paper, it will also block any following paper on streaming. Rather, we will make everything public to facilitate further research, which is also one of our contributions. - Please explain how polarstream with bidirectional padding improves in mAP with more sectors, at least in the range 2-16? This seems surprising, and diverges from results for the other tasks. (Sec. 5.3 gives a hypothesis, but can this not be measured to check?) Streaming based object detection (2-16 sectors) consistently does better than full sweep object detection while semantic segmentation does not follow this trend. The difference between anchor-free detection and semantic segmentation is that detection also requires localization of the bounding boxes. We find that the Average Orientation Error (AOE) is consistently lower for models operating on 2-16 sectors than full sweep (0.41 rad vs 0.44 rad). This adds more weight to our hypothesis in section 5.3 and bounding box regression becomes easier for streaming perception. # Reviewer 88yB [second set, rating 5] - contribution and novelty Thanks for raising this point. This paper focuses on improving object detection from a stream of lidar data -- a key component of a low latency real time detection system for AVs. Our starting point is the insight that a polar coordinate system is ideally suited to this problem domain. Based on this we develop the first polar coordinate based streaming architecture. Contributions include addressing the context issue inherent in streaming based architectures; establishing baselines (with open source code) on a public benchmark; and a thorough exploration of how other techniques such as range-stratified convolutions help benefit the design. - The baseline methods in Table 1 have not been validated on the original datasets The reviewer raises a great question here. We were unable to reproduce previous work as detailed below and we consider it a key contribution of our work to share an open-source implementation of the previous methods as well as full implementation protocol for a streaming version of the nuScenes dataset. So why were we not able to reproduce previous work? First reason is straight-forward: previous work did not release code so we were unable to run them on the nuScenes dataset. Second: previous work did not run on open source data. This is especially clear for STROBE which published on a private dataset. But it is also true for Han et al. While they evaluated on the Waymo open dataset, it was in fact evaluated on a *simulated streaming version* thereof. Critically, there are many ways to simulate streaming and we were unable to get a hold of the authors to obtain the required details. For that reason re-running our implementation of Han et al. on our version of a streaming version of Waymo Open dataset would not have sufficed to establish implementation correctness. - In the comparison on the nuScenes dataset in Fig. 5: while the method is faster it lags behind existing methods in performance. It could be that the proposed method has traded-off accuracy with latency, e.g., by using a smaller backbone. The exact details of the backbones and number of parameters in the baseline methods report are not made available – hence, it is again difficult to place these results. The reviewer correctly points out that there is a latency vs accuracy tradeoff and we concede that Figure 5 is a bit misleading. To recap: our main claim is that our method is better than other streaming methods, as shown in Table 1. Since there are so few streaming based baselines, and as mentioned in previous reply, they are neither open source nor reproducible, we also compare our results to non-streaming methods in Figure 5. The non-streaming methods use different size backbones and encoders. Below we show that we can recover state of the art detection accuracy if we use a larger backbone in our method. We also tried our full-sweep polarstream (PLS1) with the same backbone as in CenterPoint (named PLS1 heavy in the following table) and the following table is what we got for detection and semantic segmentation on NuScene validation set. CenterPoint and Cylinder3D are the SOTA methods for detection and semantic segmentation respectively. Ours-PLS1 heavy outperformed Cylinder3D with a lower latency on NuScene validation set. In our work we focus on onboard applications like streaming and those heavy backbones cannot run onboard. So we choose the encoder and backbone from the PointPillars [15] model. We provided the details of the backbone in supplementary Figure 1. | Methods | det mAP | seg mIoU | runtime | #parameters| | -------- | -------- | -------- | -------- | -------- | | CenterPoint | 56.4 | | 11Hz | 149MB| | Cylinder3D | | 76.1 |11Hz(reported) 2Hz(reproduced)| 215MB| | Ours-PLS1 | 51.2| |26Hz|65MB| |Ours-PLS1 | |73.8|34Hz|65MB| |Ours-PLS1 heavy|56.2| |11Hz|149MB| |Ours-PLS1 heavy||76.8| 15Hz|149MB| - Fig. 2: Unclear where “unfold in BEV” is given as input? To “Pillar Feature Encoder”? Please clarify. Thanks for the feedback; we will make this more clear in revision. The input is a Wedge. In more details: In the BEV, the polar pillars form a wedge-shaped region on x-y plane, but convolution requires rectangular grid-structure. We therefore need to unfold the wedge-shape input region on x-y plane to a rectangular input region on r-theta plane. One dimension is r (range) and the other dimension is theta (azimuth). - How is the ego-motion determined for bi-directional context padding For the corresponding input point P_t=[x_t, y_t, z_t, 1] at time t and P_t’=[x_t’, y_t’, z_t’, 1] at time t’, the ego-motion matrix from t to t’ is M, where M is a 4x4 matrix and P_t’ = MP_t. M is a known matrix from driving logs. For context padding, we pad the features on BEV so it is only 2D. Say the corresponding features are F_t at [x_t, y_t, 1] F_t’ at [x_t’, y_t’, 1] from time t to t’. We get 2D ego-motion M_2d by extracting the first 3x3 block from M. So M_2d is also a known 3x3 matrix. Then we use M_2d to warp feature map from time t to t’. - Is the pillar-size on L207 dependent on ‘n’? Or is it held fixed, as ‘n’ changes in the experiments? it is fixed. - In Table 1, why is there a significant drop in performance in the detection metrics when going from 16 to 32 sectors? As the reviewer points out the performance drops across the board when going from 16 to 32 sections. We believe it has to do with receptive field. 16 sectors means pi/8 rads for a sector, which at 10m range corresponds to a span of pi/8*10=3.9m -- large enough to cover a vehicle. But for 32 sectors, at 10m range, it corresponds to a span of 1m, not enough to capture a vehicle. For our methods, det map for 32 sectors is 52.8 instead of 51 (training job was not finished by the time of submission). So our gap from 16 sectors to 32 sectors is smaller than other methods (ours -1.4, Han et. al. -2.1 and STROBE -3.7), which again shows the advantage of context padding. # Reviewer 2ez1[second set, rating 6] - batch size = 1 latency of PolarPillars at different #sectors in a table. | #sectors | lidar spin latency /ms | polarstream latency /ms |end-to-end latency /ms| | -------- | -------- | -------- | -------- | | 1 |50 | 44.9 | 94.9 | | 2| 25|27.3|52.3| | 4| 12.5 | 22.8 | 35.3 | | 8 | 6.3 | 19.2 | 25.5 | | 16 | 3.1 | 16.1 | 19.3 | | 32 | 1.6 | 12.6 | 14.2 | also refer to Figure 1. Note that polarstream latency does not scale down linearly as #sectors increases. The amount of computation scales linearly as #sectors increases but latency is more complicated. It also depends how parallel the computation is, which depends on if the gpu memory is saturated. This is also observed in Han et. al.[13] Figure 5. - whether there's some chance that we can integrate SparseConv-based feature extractors into PolarPillars in the streaming mode. The reviewer raises a good point. Our model is compatible with all voxel-based method. It’s absolutely feasible to integrate SparseConv especially for low-level features and save more memory. - Potential experiment results on detection only will be very helpful. We tried detection only for 32 sectors and our method got detection map at 50, compared to 48.7 for Han et al., which is consistent to results in Table 1. - minor concern on the end-to-end latency calculation for PolarPillars Using 10-sweep point cloud is a common practice as in CenterPoint[35], HotSpotNet[6], CVCNet[5], PointPainting[29], PointPillars[15]. 1-sweep point cloud is sparser and leads to inferior detection results compared to 10-sweep (mAP 46.7 vs 50.6). We did not consider point clouds loading time because we assume it is immediately accessible onboard from sensor. Plus, since all methods use 10-sweep data, data loading is not a variant we care about. Rather, we take data warping time, the time of transforming points from previous nine frames to current frame, into account. STROBE does not need data warping while all other methods including our methods do. And data warping time for full-sweep is 2ms, approximate 4.6% of our full-sweep polarstream latency.