Reviewer dNFe [first set, rating 5]

# Reviewer dNFe [first set, rating 5] - justification for using polar coordinates Cuboid-shaped voxels waste computation and memory because they use large feature maps than ours.Feature map size show below. For more than 8 sectors, cartesian pillars use twice the feature maps size as ours because the way they partition the input region(figure 4 showed an example of 8 sectors, where for cartesian pillars half of input region is empty). | #sectors | 1 | 2| 4| 8| 16| 32| | ----------- | ----------- |----------- |----------- |----------- |----------- |----------- | | Cartesian | 512x512 |512x256|512x128|512x128|512x64|512x32| |Polar | 512x512 |512x256|512x128|512x64|512x32|512x16| Here is the memory usage of the feature map ‘canvas’ as referred to in PointPillars. The memory is per sector in MB: | #sectors | 1 | 2| 4| 8| 16| 32| | ----------- | ----------- |----------- |----------- |----------- |----------- |----------- | | Cartesian | 33.6 |16.8|8.4|8.4|4.2|2.1| |Polar | 33.6 |16.8|8.4|4.2|2.1|1.3| We are not seeing a noticeable improvement in runtime because we are measuring the runtime on a powerful V100 GPU, where feature map computation can run in parallel when gpu memory is not saturated. However, installing V100 on an onboard pipeline may not be feasible due to its power consumption which would limit the battery range of the autonomous electric vehicle. The networks will need to be deployed on efficient embedded platforms like FPGAs where the increased feature map size of cartesian sectors will result in increased memory and latency. On the other hand, polar representation enables multi-scale context padding, an effective and efficient fix for the major limitation of streaming - limited spatial context of each sector. Previous streaming papers both focus on solving this issue but neither worked as well as context padding. - it is unclear whether their proposed solution is still effective when applied to these models with better performance We also tried our full-sweep polarstream with the same 3D ResNet-like backbone as in CenterPoint (named PLS1 heavy in the following table) and the following table is what we got for detection and semantic segmentation on NuScene validation set. Ours-PLS1 heavy outperformed Cylinder3D with a lower latency on NuScene validation set. In our work we focus on onboard applications like streaming and those heavy backbones cannot run onboard so we choose the encoder and backbone from the PointPillars [15] model. | Methods | det mAP | seg mIoU | runtime | #parameters| | -------- | -------- | -------- | -------- | -------- | | CenterPoint | 56.4 | | 11Hz | 149MB| | Cylinder3D | | 76.1 |11Hz(reported) 2Hz(reproduced)| 215MB| | Ours-PLS1 | 51.2| |26Hz|65MB| |Ours-PLS1 | |73.8|34Hz|65MB| |Ours-PLS1 heavy|56.2| |11Hz|149MB| |Ours-PLS1 heavy||76.8| 15Hz|149MB| - justify why they choose to perform these tasks jointly and whether this joint detection-segmentation task is a valid setting. Autonomous driving is still a largely unsolved problem and it is definitely an open research question on whether online perception via detection or semantic segmentation tasks is more favorable to the downstream tracking and planning modules of an AV. Other than detecting 3D boxes directly, an equally plausible perception pipeline is doing lidar segmentation, differentiating foreground lidar points against drivable surface, clustering those lidar points and tracking those clusters. The above-mentioned pipeline might also be better suited for detecting irregularly shaped generic objects like tree branches fallen on the road where a 3D box might not be the best representation for them. Our work also shows that detection accuracy improves when jointly trained with semantic segmentation. A combination of both can produce more reliable perception results for downstream tasks. - typos Thank you for having a careful look and finding out these typos. We will make sure to correct them in the camera ready version. We will also have a careful look at the paper again to make sure that any more such typos don't exist. # Reviewer 1vpo [first set, rating 5] - Novelty Our major contribution is not polar representation. The first major contribution is to solve the limited spatial context issue for streaming. Hence we proposed multi-scale context padding, which must be built on top of polar representation. Secondly we dig into the limitation of polar representation for detection and address the challenges by range-stratified and feature undistortion. The distortion problem we addressed is similar to omnidirectional cameras in [a,b]. [a] adjusts sampling locations by heuristics and [b] adjusts kernel shapes at each row (also by heuristics). Their motivation is similar to ours, to undistort the features by some adaptive sampling strategy. It is normal to share the same motivation because we are trying to address similar issues but the approaches are substantially different. In feature undistortion we find the connection between convolution and bilinear sampling and automate the sampling process by convolution. Thanks for pointing to [a,b]. We were not familiar with 360 images. This shows that dealing with distorted data is a common issue and we will add them to related work for discussion. - Clarity Thank you for having a careful look and giving us your feedback. We will take another pass at the paper before our camera-ready submission to improve the readability. - point are accumulated from 10 successive frames: how did the author pick this specific parameter, and what is the weight of this value on the performance? the Range Stratified Normalization normalizes over individual regions within a certain range rather than on the entire spatial domain: how are these regions selected? It seems to me like the choice/number of regions should be triggered on the distance of objects and scene components with respect to the sensor. As these regions seem to be obtained by discretizing the spatial range, what happens to an object lying in between two regions? Would it receive two different normalizations? using 10 successive frames is common practice for NuScenes as in CenterPoint[35], HotSpotNet[6], CVCNet[5], PointPainting[29], PointPillars[15], because single-frame results in very sparse point clouds and poor detection performance and high velocity error(single frame det map 46.7 vs 50.6 for 10 frames). In Fig 2 we show an example of 3 stratums. The feature map has spatial size 64x64 on r-theta plane. The first dimension 64 is range( r) and second dimension is azimuth(theta). We divide range dimension into 8 stratums, each stratum of size 8x64. Here is the ablation study of different #stratum following Table 2 and we can see a trend that adding #stratum helps detection. We choose 8 stratums to make each stratum moderately larger than the convolution kernel size. | #stratum | 1 | 2 |4|8|16| | ----------- | ----------- |----------- |----------- |----------- |----------- | | det mAP | 48.2 | 48.1| 48.8 | 49.1| 49.2| If an object lies between two regions, it will receive two different normalization. This is desired because even within one object, the near end to sensor looks bigger and far end looks smaller. We need different normalization even within the same object. Also we did not apply range-stratified convolution & normalization to the final layer, and the final regular 3x3 conv layer is able to leverage information from different normalization. # Reviewer X4zp [first set, rating 5] - comparison between polar and cartesian in memory and computation Cuboid-shaped voxels waste computation and memory because they use larger feature maps than ours.Feature map size comparsison is shown below. For more than 8 sectors, cartesian pillars use twice the feature maps size as ours because the way the partition the input region(Figure 4 showed an example of 8 sectors, where for cartesian pillars half of input region is empty). | #sectors | 1 | 2| 4| 8| 16| 32| | ----------- | ----------- |----------- |----------- |----------- |----------- |----------- | | Cartesian | 512x512 |512x256|512x128|512x128|512x64|512x32| |Polar | 512x512 |512x256|512x128|512x64|512x32|512x16| Here is the memory usage of the feature map ‘canvas’ as referred to in PointPillars. The memory is per sector in MB: | #sectors | 1 | 2| 4| 8| 16| 32| | ----------- | ----------- |----------- |----------- |----------- |----------- |----------- | | Cartesian | 33.6 |16.8|8.4|8.4|4.2|2.1| |Polar | 33.6 |16.8|8.4|4.2|2.1|1.3| - visualization or case analyses Our visualization shows that baseline methods have a lot of false positive detection bboxes at the boundaries for 32 sectors (because empty regions or ‘noises’ introduced by previous methods have similar effects to adversarial examples?), while our polarstream with bidirectional padding have fewer false positives because we pad valid features. Here is an anonymous link for visualization: https://i.imgur.com/iYSmF2L.png ![](https://i.imgur.com/iYSmF2L.png) We will add to supplementary for revision. - For table 1, the detection result for <=4 sectors are worse than Cartesian, even with CP. Why is this the case? This seems to contradict with the claim that Polar representation is better. And why segmentation tasks do not have such an effect? Insights are needed. Our finding is cartesian representation is better for full-sweep detection, and polar representation is better for streaming and semantic segmentation. Detection results are worse for n <= 4 because full-sweep detection is worse and for n <= 4 sectors context padding does not show a lot of effects because the context is still enough (previous streaming methods do not have effects either). Streaming is an important onboard perception application for its reduced latency (in Figure 1 we reported 95ms for full-sweep and 14ms for 32 sector) and latency is extremely important because AVs must respond to the dynamic environment immediately. A polar coordinate system is ideally suited to streaming and enables context padding. Why cartesian coordinates is better for full-sweep detection and polar coordinates is better for semantic segmentation? In the following table we show that cartesian coordinates has a higher performance upperbound in detection and polar coordinates has a higher upperbound in semantic segmentation. | | pillar size | input size |det mAP upperbound|seg mIoU upperbound| | -------- | -------- | -------- | -------- |-------- | | Cartesian | 0.2m x 0.2m | 512x512 |98.9|92.4| | Polar | 0.098m x 0.0123 rad| 512x512 | 96.7|95.1| These upperbounds are what the models can get if learning is 100% correct. They are obtained by replacing predictions with ground truth labels during inference. The upperbounds are not 100 for semantic segmentation because the network is doing pillar-level semantic segmentation and pillar labels and point labels may not agree. There is less disagreement between point semantic label and pillar semantic label in polar pillars so polar pillar segmentation miou has a higher upperbound (also reported in PolarNet paper). For detection,the upperbounds are not 100 because some bboxes clutter around one pillar and one pillar can only represent one bbox so all other boxes are ignored or suppressed by NMS. In polar pillars this is more severe because polar pillar size far from the sensor is large and gets more boxes cluttered (results also show polar pillars is less accurate in detecting distant objects). On the other hand, distortion discussed in Sec. 3.3 makes learning harder for detection using polar pillars. - Range Stratified Convolution: this is extremely unclear for how the kernels are allocated for each grid. Also relevant ablation studies are needed. In Fig 2 we show an example of 3 stratums. The feature map has spatial size 64x64 on r-theta plane. The first dimension 64 is range( r) and second dimension is azimuth(theta). We divide range dimension into 8 stratums, each stratum of size 8x64. We apply convolution independently within each stratum. We will amend the following ablation study to table 2. | method | baseline | +range stratified conv | +range stratified conv&norm | | -------- | -------- | -------- |-------- | | det mAP | 48.2 | 48.9 |49.1| We also have the following ablation study for #stratum. | #stratum | 1 | 2 |4|8|16| | ----------- | ----------- |----------- |----------- |----------- |----------- | | det mAP | 48.2 | 48.1| 48.8 | 49.1| 49.2| - Section 5.2, multi-scale padding: It is still unclear why detection performance improves when the number of sectors increases. It might be because there are more overlapping bbox proposals, or NMS strategy. More detailed analysis is needed. Is the reviewer asking why detection performance improves when the number of sectors increases for streaming generally or particularly for multi-scale padding? For streaming, the hypothesis is that smaller sector results in smaller variation in point cloud coordinates. This is similar to normalization and makes learning easier. We believe more overlapping box proposals does not help. We tried three NMS strategies: 1) local NMS: NMS within current sector and gather boxes for all sectors after NMS 2) stateful-NMS: gather boxes for current sector and previous sectors together before NMS and apply NMS to these boxes 3) global-NMS: gather boxes for all sectors together before NMS and apply NMS for all sectors local NMS will result in more overlapping boxes because stateful-NMS and global-NMS suppress overlapping bboxes at the boundaries. But local NMS is -0.5 mAP worse than stateful-NMS and global-NMS. The number of overlapping boxes does not matter. What matters is the quality of bboxes, i.e., whether the model has learned powerful features. - directly accumulating 10 frames may incur localization error which may interfere with the detection result. Single frame baseline is also needed. We accumulate 10 frames because: Using 10 frames is common practice on the nuScenes benchmark as in the nuScenes, CenterPoint paper etc. Single-frame results in very sparse point clouds and poor detection performance especially in velocity estimation (single frame det map 46.7 vs 50.6 for 10 frames). point clouds in previous frames are motion compensated, which reduces localization error. - Not sure why the authors put emphasis on the point pillar backbone. Any other backbone can do the job. We completely agree with your statement that any other backbone can do the job. In fact, any other encoder (voxel based encoder vs pillar based encoder) will also work. Our main reason for choosing the PointPillars encoder and backbone was it's low latency and high performance which makes it very attractive for onboard applications. We did not experiment with other encoders/backbones because it falls outside the scope of this work. - For feature undistortion, why only apply this method on the classification head? The more appropriate way is to apply on the backbone. This experiment is needed. The reviewer raised a good question. We did try feature undistortion in the backbone and below shows the comparison. | | det mAP | seg mIoU |latency| | -------- | -------- | -------- |-------- | | feature undistortion in backbone | 49.9 | 70.6 | 55ms| |feature undistortion in head|51.2|73.4|45ms| Feature undistortion in backbone led to worse performance especially in semantic segmentation. The motivation for feature undistortion is trying to mimic cartesian representation because cartesian representation is better in detection(as discussed earlier). But the backbone is shared by detection and segmentation heads and cartesian representation is worse in semantic segmentation. So applying undistortion in backbone is not optimal. Another reason we do not want feature undistortion in backbone is the feature maps are large in the backbone so adding feature undistortion will add a lot of computation and thus significantly higher latency. - Section 5.3, diagnosis part: this part of analyzing two previous methods is not relevant to the main body. While we greatly appreciate the reviewer's comments on the paper, we would like to respectfully disagree with the reviewer on this. We strongly believe that the analysis in Section 5.3 will be very useful for the research community as it gives them a complete perspective on how the streaming methods work, what are its advantages and what are its limitations. We believe that analysis sections like 5.3 broadens the readers perspective which will help them devise even better solutions than ours in the future.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.