Chen Qi
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Reviewer dNFe [first set, rating 5] - justification for using polar coordinates Cuboid-shaped voxels waste computation and memory because they use large feature maps than ours.Feature map size show below. For more than 8 sectors, cartesian pillars use twice the feature maps size as ours because the way they partition the input region(figure 4 showed an example of 8 sectors, where for cartesian pillars half of input region is empty). | #sectors | 1 | 2| 4| 8| 16| 32| | ----------- | ----------- |----------- |----------- |----------- |----------- |----------- | | Cartesian | 512x512 |512x256|512x128|512x128|512x64|512x32| |Polar | 512x512 |512x256|512x128|512x64|512x32|512x16| Here is the memory usage of the feature map ‘canvas’ as referred to in PointPillars. The memory is per sector in MB: | #sectors | 1 | 2| 4| 8| 16| 32| | ----------- | ----------- |----------- |----------- |----------- |----------- |----------- | | Cartesian | 33.6 |16.8|8.4|8.4|4.2|2.1| |Polar | 33.6 |16.8|8.4|4.2|2.1|1.3| We are not seeing a noticeable improvement in runtime because we are measuring the runtime on a powerful V100 GPU, where feature map computation can run in parallel when gpu memory is not saturated. However, installing V100 on an onboard pipeline may not be feasible due to its power consumption which would limit the battery range of the autonomous electric vehicle. The networks will need to be deployed on efficient embedded platforms like FPGAs where the increased feature map size of cartesian sectors will result in increased memory and latency. On the other hand, polar representation enables multi-scale context padding, an effective and efficient fix for the major limitation of streaming - limited spatial context of each sector. Previous streaming papers both focus on solving this issue but neither worked as well as context padding. - it is unclear whether their proposed solution is still effective when applied to these models with better performance We also tried our full-sweep polarstream with the same 3D ResNet-like backbone as in CenterPoint (named PLS1 heavy in the following table) and the following table is what we got for detection and semantic segmentation on NuScene validation set. Ours-PLS1 heavy outperformed Cylinder3D with a lower latency on NuScene validation set. In our work we focus on onboard applications like streaming and those heavy backbones cannot run onboard so we choose the encoder and backbone from the PointPillars [15] model. | Methods | det mAP | seg mIoU | runtime | #parameters| | -------- | -------- | -------- | -------- | -------- | | CenterPoint | 56.4 | | 11Hz | 149MB| | Cylinder3D | | 76.1 |11Hz(reported) 2Hz(reproduced)| 215MB| | Ours-PLS1 | 51.2| |26Hz|65MB| |Ours-PLS1 | |73.8|34Hz|65MB| |Ours-PLS1 heavy|56.2| |11Hz|149MB| |Ours-PLS1 heavy||76.8| 15Hz|149MB| - justify why they choose to perform these tasks jointly and whether this joint detection-segmentation task is a valid setting. Autonomous driving is still a largely unsolved problem and it is definitely an open research question on whether online perception via detection or semantic segmentation tasks is more favorable to the downstream tracking and planning modules of an AV. Other than detecting 3D boxes directly, an equally plausible perception pipeline is doing lidar segmentation, differentiating foreground lidar points against drivable surface, clustering those lidar points and tracking those clusters. The above-mentioned pipeline might also be better suited for detecting irregularly shaped generic objects like tree branches fallen on the road where a 3D box might not be the best representation for them. Our work also shows that detection accuracy improves when jointly trained with semantic segmentation. A combination of both can produce more reliable perception results for downstream tasks. - typos Thank you for having a careful look and finding out these typos. We will make sure to correct them in the camera ready version. We will also have a careful look at the paper again to make sure that any more such typos don't exist. # Reviewer 1vpo [first set, rating 5] - Novelty Our major contribution is not polar representation. The first major contribution is to solve the limited spatial context issue for streaming. Hence we proposed multi-scale context padding, which must be built on top of polar representation. Secondly we dig into the limitation of polar representation for detection and address the challenges by range-stratified and feature undistortion. The distortion problem we addressed is similar to omnidirectional cameras in [a,b]. [a] adjusts sampling locations by heuristics and [b] adjusts kernel shapes at each row (also by heuristics). Their motivation is similar to ours, to undistort the features by some adaptive sampling strategy. It is normal to share the same motivation because we are trying to address similar issues but the approaches are substantially different. In feature undistortion we find the connection between convolution and bilinear sampling and automate the sampling process by convolution. Thanks for pointing to [a,b]. We were not familiar with 360 images. This shows that dealing with distorted data is a common issue and we will add them to related work for discussion. - Clarity Thank you for having a careful look and giving us your feedback. We will take another pass at the paper before our camera-ready submission to improve the readability. - point are accumulated from 10 successive frames: how did the author pick this specific parameter, and what is the weight of this value on the performance? the Range Stratified Normalization normalizes over individual regions within a certain range rather than on the entire spatial domain: how are these regions selected? It seems to me like the choice/number of regions should be triggered on the distance of objects and scene components with respect to the sensor. As these regions seem to be obtained by discretizing the spatial range, what happens to an object lying in between two regions? Would it receive two different normalizations? using 10 successive frames is common practice for NuScenes as in CenterPoint[35], HotSpotNet[6], CVCNet[5], PointPainting[29], PointPillars[15], because single-frame results in very sparse point clouds and poor detection performance and high velocity error(single frame det map 46.7 vs 50.6 for 10 frames). In Fig 2 we show an example of 3 stratums. The feature map has spatial size 64x64 on r-theta plane. The first dimension 64 is range( r) and second dimension is azimuth(theta). We divide range dimension into 8 stratums, each stratum of size 8x64. Here is the ablation study of different #stratum following Table 2 and we can see a trend that adding #stratum helps detection. We choose 8 stratums to make each stratum moderately larger than the convolution kernel size. | #stratum | 1 | 2 |4|8|16| | ----------- | ----------- |----------- |----------- |----------- |----------- | | det mAP | 48.2 | 48.1| 48.8 | 49.1| 49.2| If an object lies between two regions, it will receive two different normalization. This is desired because even within one object, the near end to sensor looks bigger and far end looks smaller. We need different normalization even within the same object. Also we did not apply range-stratified convolution & normalization to the final layer, and the final regular 3x3 conv layer is able to leverage information from different normalization. # Reviewer X4zp [first set, rating 5] - comparison between polar and cartesian in memory and computation Cuboid-shaped voxels waste computation and memory because they use larger feature maps than ours.Feature map size comparsison is shown below. For more than 8 sectors, cartesian pillars use twice the feature maps size as ours because the way the partition the input region(Figure 4 showed an example of 8 sectors, where for cartesian pillars half of input region is empty). | #sectors | 1 | 2| 4| 8| 16| 32| | ----------- | ----------- |----------- |----------- |----------- |----------- |----------- | | Cartesian | 512x512 |512x256|512x128|512x128|512x64|512x32| |Polar | 512x512 |512x256|512x128|512x64|512x32|512x16| Here is the memory usage of the feature map ‘canvas’ as referred to in PointPillars. The memory is per sector in MB: | #sectors | 1 | 2| 4| 8| 16| 32| | ----------- | ----------- |----------- |----------- |----------- |----------- |----------- | | Cartesian | 33.6 |16.8|8.4|8.4|4.2|2.1| |Polar | 33.6 |16.8|8.4|4.2|2.1|1.3| - visualization or case analyses Our visualization shows that baseline methods have a lot of false positive detection bboxes at the boundaries for 32 sectors (because empty regions or ‘noises’ introduced by previous methods have similar effects to adversarial examples?), while our polarstream with bidirectional padding have fewer false positives because we pad valid features. Here is an anonymous link for visualization: https://i.imgur.com/iYSmF2L.png ![](https://i.imgur.com/iYSmF2L.png) We will add to supplementary for revision. - For table 1, the detection result for <=4 sectors are worse than Cartesian, even with CP. Why is this the case? This seems to contradict with the claim that Polar representation is better. And why segmentation tasks do not have such an effect? Insights are needed. Our finding is cartesian representation is better for full-sweep detection, and polar representation is better for streaming and semantic segmentation. Detection results are worse for n <= 4 because full-sweep detection is worse and for n <= 4 sectors context padding does not show a lot of effects because the context is still enough (previous streaming methods do not have effects either). Streaming is an important onboard perception application for its reduced latency (in Figure 1 we reported 95ms for full-sweep and 14ms for 32 sector) and latency is extremely important because AVs must respond to the dynamic environment immediately. A polar coordinate system is ideally suited to streaming and enables context padding. Why cartesian coordinates is better for full-sweep detection and polar coordinates is better for semantic segmentation? In the following table we show that cartesian coordinates has a higher performance upperbound in detection and polar coordinates has a higher upperbound in semantic segmentation. | | pillar size | input size |det mAP upperbound|seg mIoU upperbound| | -------- | -------- | -------- | -------- |-------- | | Cartesian | 0.2m x 0.2m | 512x512 |98.9|92.4| | Polar | 0.098m x 0.0123 rad| 512x512 | 96.7|95.1| These upperbounds are what the models can get if learning is 100% correct. They are obtained by replacing predictions with ground truth labels during inference. The upperbounds are not 100 for semantic segmentation because the network is doing pillar-level semantic segmentation and pillar labels and point labels may not agree. There is less disagreement between point semantic label and pillar semantic label in polar pillars so polar pillar segmentation miou has a higher upperbound (also reported in PolarNet paper). For detection,the upperbounds are not 100 because some bboxes clutter around one pillar and one pillar can only represent one bbox so all other boxes are ignored or suppressed by NMS. In polar pillars this is more severe because polar pillar size far from the sensor is large and gets more boxes cluttered (results also show polar pillars is less accurate in detecting distant objects). On the other hand, distortion discussed in Sec. 3.3 makes learning harder for detection using polar pillars. - Range Stratified Convolution: this is extremely unclear for how the kernels are allocated for each grid. Also relevant ablation studies are needed. In Fig 2 we show an example of 3 stratums. The feature map has spatial size 64x64 on r-theta plane. The first dimension 64 is range( r) and second dimension is azimuth(theta). We divide range dimension into 8 stratums, each stratum of size 8x64. We apply convolution independently within each stratum. We will amend the following ablation study to table 2. | method | baseline | +range stratified conv | +range stratified conv&norm | | -------- | -------- | -------- |-------- | | det mAP | 48.2 | 48.9 |49.1| We also have the following ablation study for #stratum. | #stratum | 1 | 2 |4|8|16| | ----------- | ----------- |----------- |----------- |----------- |----------- | | det mAP | 48.2 | 48.1| 48.8 | 49.1| 49.2| - Section 5.2, multi-scale padding: It is still unclear why detection performance improves when the number of sectors increases. It might be because there are more overlapping bbox proposals, or NMS strategy. More detailed analysis is needed. Is the reviewer asking why detection performance improves when the number of sectors increases for streaming generally or particularly for multi-scale padding? For streaming, the hypothesis is that smaller sector results in smaller variation in point cloud coordinates. This is similar to normalization and makes learning easier. We believe more overlapping box proposals does not help. We tried three NMS strategies: 1) local NMS: NMS within current sector and gather boxes for all sectors after NMS 2) stateful-NMS: gather boxes for current sector and previous sectors together before NMS and apply NMS to these boxes 3) global-NMS: gather boxes for all sectors together before NMS and apply NMS for all sectors local NMS will result in more overlapping boxes because stateful-NMS and global-NMS suppress overlapping bboxes at the boundaries. But local NMS is -0.5 mAP worse than stateful-NMS and global-NMS. The number of overlapping boxes does not matter. What matters is the quality of bboxes, i.e., whether the model has learned powerful features. - directly accumulating 10 frames may incur localization error which may interfere with the detection result. Single frame baseline is also needed. We accumulate 10 frames because: Using 10 frames is common practice on the nuScenes benchmark as in the nuScenes, CenterPoint paper etc. Single-frame results in very sparse point clouds and poor detection performance especially in velocity estimation (single frame det map 46.7 vs 50.6 for 10 frames). point clouds in previous frames are motion compensated, which reduces localization error. - Not sure why the authors put emphasis on the point pillar backbone. Any other backbone can do the job. We completely agree with your statement that any other backbone can do the job. In fact, any other encoder (voxel based encoder vs pillar based encoder) will also work. Our main reason for choosing the PointPillars encoder and backbone was it's low latency and high performance which makes it very attractive for onboard applications. We did not experiment with other encoders/backbones because it falls outside the scope of this work. - For feature undistortion, why only apply this method on the classification head? The more appropriate way is to apply on the backbone. This experiment is needed. The reviewer raised a good question. We did try feature undistortion in the backbone and below shows the comparison. | | det mAP | seg mIoU |latency| | -------- | -------- | -------- |-------- | | feature undistortion in backbone | 49.9 | 70.6 | 55ms| |feature undistortion in head|51.2|73.4|45ms| Feature undistortion in backbone led to worse performance especially in semantic segmentation. The motivation for feature undistortion is trying to mimic cartesian representation because cartesian representation is better in detection(as discussed earlier). But the backbone is shared by detection and segmentation heads and cartesian representation is worse in semantic segmentation. So applying undistortion in backbone is not optimal. Another reason we do not want feature undistortion in backbone is the feature maps are large in the backbone so adding feature undistortion will add a lot of computation and thus significantly higher latency. - Section 5.3, diagnosis part: this part of analyzing two previous methods is not relevant to the main body. While we greatly appreciate the reviewer's comments on the paper, we would like to respectfully disagree with the reviewer on this. We strongly believe that the analysis in Section 5.3 will be very useful for the research community as it gives them a complete perspective on how the streaming methods work, what are its advantages and what are its limitations. We believe that analysis sections like 5.3 broadens the readers perspective which will help them devise even better solutions than ours in the future.

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully