:::info
# <center><i class="fa fa-edit"></i> Week 4: Truncated Signed Distance Function </center>
:::
# Paper Summary ~ KinectFusion: Real-Time Dense Mapping and Tracking (2011)
**Abstract**
Newcombe et al., (2011) describes a novel approach where low cost depth camera (Microsoft Kinect ) can be used for dense volumetric reconstruction as well as tracking of complex indoor scenes. The depth camera system is affordable to many (probably between 15 - 30 euros) making it really suitable and ideal for to-do projects.
*How does the system work?*
Well the system fuses the depth data that is obtained from the Kinect into a single global implicit surface model while at the same time tracking the pose of the sensor by aligning the live depth data with global model that is continously growing. To realize the frame to model, parallel processing comes in handy this ensures that high accuracy is achieved, there is very limited drift and a level of detail suitable for AR applications is achieved.
___
## **Introduction**
Imagine finding yourself in a room filled with objects, your goal being to reconstruct a 3D model of the room with all objects correctly positioned. Assume you possess eyes capable of continuously computing depth, exact position, and orientation as you scan the room. This introduces the concept of sensor pose. To achieve a dense volumetric reconstruction of the scene using our depth camera, we must remember that the sensor has 6 degrees of freedom (DOF), encompassing translation in 3D space and rotation about the three axes. Therefore, to correctly position the captured scene within the implicit 3D global coordinate system (the implicit 3D surface model), we need a matrix that projects the captured scenes from the camera coordinate system to this global frame.

**localisation/tracking** is the realtime attempt of trying to estimate the pose.
**Why is localisation/tracking so important?**
Recall, that the kinect depth camera system fuses the data data into a implicit 3D model and this makes it crucial to ensure an accurate pose, this is because without accurate pose it is impossible for the incoming depth measurements to be aligned with the existing model which eventually leads to errors and distorted reconstruction.
Unlike the conventional methods where the current frames are aligned with previous frames (frame to frame tracking), aligning the frames with global 3D model minimizes drift (drift is the gradual accumulatio of errors)
Equally, with the estimated sensor pose, it is possible to have a surface prediction. Generally, the system uses the estimated pose to generate a dense surface prediction using a technique called raycasting where raycasting of the global 3D model is made to vistual camera which is located at the estimared pose. Ideally, we know that the predicted surface is what we expect the camera to see if pose estimate is correct and also the model is accurate. The incoming data from the camera is then aligned against the predicted surface using an algorithm known as the Iterative Closest Point (ICP).
The system has checks on assessing the validity as well as the stability of the pose update and when the tracking is seen to be inaccurate for example when scene geometry is textureless , the system may detect potential failure and subsequently will prevent model from updating to avoid the corruption of the global model.
## **Background**
**Kinect Sensor**
Kinect sensor was used to generate 11 bit, 640 x 480 depth maps and this is achieved by incorporating light based depth sensor. The associated challenge only is numerous holes where there is no depth reading due to material structure, glancing angles or thgin structures as well the potential motion blur when camera is moved too fast.
**Drift-Free SLAM for AR**
Systems like MonoSLAM and PTAM enabled flexible, infrastructure-free Augmented Reality (AR) applications with real-time mapping. However, these systems were optimised for efficient camera tracking and produced sparse point cloud models, enabling only rudimentary scene reconstruction.,
**Dense Scene Representation**
Some of the popular approaches that have been employed to fuse the dense depth measurements from sensors include occupancy mapping where grid of cells are employed which provide information about free space and surfaces. Another approach is the use of signed distance function which represents surfaces as zero crossing and a more recent applicatioon of it is the truncated signed distance function.
**Dense SLAM with active depth sensing**
Commodity depth cameras like Kinect has spurred real time 3D reconstruction some examples include the RGB-D approaches.
### Method
Surface Measurement - Dense vertex map and normal map pyramids are generated from Kinect device
Surface Reconstruction - Surface measurements is integrated into scene model where it is maintained with volumetric, truncated signed distance function representation
Surface Prediction - Use raycasting of the signed distance function to the estimated frame to proivde dense surface prediction against which we align the depth map
Sensor tracking - Achieved using multi-scale ICP alignment between current sensor measurement and predicted surface.
### Surface Measurement (Pre-processing Stage)
The received data as we mentioned contains holes and noise and therefore we need to improve the data qulaity by applying bilateral filter which in this case reduces the noise while preserving sharp discontinuities in depth. The filtered depth values are back projetcted into the sensor's coordinate frame to create a dense vertex map. Recall, the frames from depth sensors are surface measurement on normal grid, so we can use cross product between the neighbours to obtain normal maps. To asses the validity of the depth maps, a mask is used. Multiscale surface representation in form of vertex and normal map pyramid is computed.
### Mapping as Surface Reconstruction
The consecutive depth frames with the camera poses are fused into 3D reconstruction using volumetric truncated signed distance function. In signed distance function values correspodning signed distance closest to 0 crossing takes on increasing positive values when moving from the visible surface to free space while negative values from surface to non visible surface and the result is global surface fusion. The frame is fused with the existing TSDF using a weighted running average which combines data from multiple noisy measurements, only regions which correspond to valid measurements are updated.
### Surface Prediction from Ray Casting
Recall, the system tracks frames against a globally fuseed 3D model, to achieve this the system needs to know what the global model looks like from“the persepctive of the sensor and this brings the idea of surface prediction. The systems performs a raycast of the global TSDF model into a virtual camera view amd for each pixel in the virtual view ray casting from the camera position into the scene is performed and the system moves until there is zero crossing in TSDF. The process yields a predicted vertex map and normal map which is representative of what the sensor should see.
### Sensor Pose estimation
the system tracks the current sensor frame against the globally fused 3D model performs a dense Iterative Closest Point (ICP) alignment where measured surface is aligned with predicted surface and sll the data is used fpr pose estimation.
### Experiment
* Frame to Frame Tracking - Frames fused based on frame-to-frame ICP to show rapid error accumulation
* Partial loop closure - Used L frames out of N to show system's convergence properties
* Complete Loop closure (frame to model ) - Excellent tracking and scene reconstruction quality
* Repeated loop closures - N frames M times led to better alignment and reconstruction
The system was found to show graceful degradation in quality when fewer input frames were used or when voxel resolution was used. Equally, the system was robust to the lighting conditions.
### challenges
+ Requires significant memory for large scales
+ Large trajectories leads to misalignment and inevitable drift
+