# (7/24)Computer Vision Recent Paper:Joint Monocular 3D Vehicle Detection and Tracking
###### tags:`paper`
[toc]
---
## Before Meeting
:::success
### Author
- Hou-Ning Hu
- https://scholar.google.com/citations?user=VNSzxhUAAAAJ&hl=en
- 
- Qi-Zhi Cai
- https://scholar.google.com/citations?user=oyh-YNwAAAAJ&hl=en
- 
- Min Sun
- https://scholar.google.com/citations?user=1Rf6sGcAAAAJ&hl=zh-TW
- 
- Dequan Wang
- https://scholar.google.com/citations?user=kFvxQ7YAAAAJ&hl=en
- 
- Ji Lin
- https://scholar.google.com/citations?user=dVtzVVAAAAAJ&hl=en
- 
- Trevor Darrell
- https://scholar.google.com/citations?user=bh-uRFMAAAAJ&hl=en
- 
- Fisher Yu
- https://scholar.google.com/citations?user=-XCiamcAAAAJ&hl=en
- 
:::
[refer](https://arxiv.org/pdf/1811.10742.pdf)
[refer]()
[refer]()
---
## Recent Paper
---
### Joint Monocular 3D Vehicle Detection and Tracking
:::success
#### Abstracion
- 3D vehicle detection and tracking from a monocular camera requires detecting and associating vehicles, and estimating their locations and extents together
- Our approach leverages 3D pose estimation to learn 2D patch association overtime and uses temporal information from tracking to obtain stable 3D estimation
- Our method also leverages 3D box depth ordering and motion to link together the tracks of occluded objects
:::
:::info
#### Detail
- Introduction
- perceive the 3D world in both space and time from simple sequences of 2D images rather than 3D point cloud
- 
- Good tracking helps 3D detection, as information along consecutive frames is integrated. Good 3D detection helps tracking, as ego-motion can be factored out.
- deep network architecture to track and detect vehicles jointly in 3D from a series of monocular color images
- After detecting 2D bounding box of targets, we utilize both world coordinates and re-projected camera coordinates to associate instances cross frames
- Related Works
- Object tracking
- Object detection
- Driving datasets
- 
- Joint 3D Detection and Tracking
- Our goal is to track objects and infer their precise 3D location, orientation, and dimension from a single monocular video stream
- We model 3D information with a layer-aggregating network on the object proposals.
- We leverage estimated 3D information of current trajectories to track them through time, using 3D re-projection to generate similarity metric between all trajectories and detected boxes
- Problem Formulation
- 
- convolutional network pipeline trained on very large amount of ground truth supervision
-
- Candidate Box Detection
- Faster R-CNN [35] trained on our dataset to provide bounding boxes of object proposals
- 3D box center projection
- To estimate 3D layout from single image more accurately, we extends the design of Region Proposal Network (RPN) to hypothesize a projected 2D point from 3D bounding box center
- raw images are fed into a deep layeraggregated ConvNet to generate global convolutional feature maps.
- 3D Box Estimation
- Data Association and Tracking
- Occlusion-aware Data Association.
- 
- Depth-Ordering Matching
- Motion Model
- Deep Motion Estimation and Update
- 3D Vehicle Tracking Simulation Dataset
- Experiments
- 3D Estimation
- Object Tracking
- Overall Evaluation
- Implementation
- Training
- Dataset
- Results
- 3D for Tracking
- Tracking for 3D
- Real-world Evaluation
- Amount of Data Matters
-
:::
:::warning
#### Conclusion
- In this paper, we learn 3D vehicle dynamics from monocular videos. We propose a novel framework, combining spatial visual feature learning and global 3D state estimation,to track moving vehicles in 3D world.
:::
[refer]()
---
:::success
#### Abstracion
:::
:::info
#### Detail
:::
:::warning
#### Conclusion
:::
[refer]()
---