# Notes on: [Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age](https://ieeexplore.ieee.org/document/7747236) #### Authors: Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jos´e Neira, Ian Reid, John J. Leonard #### Published in: IEEE Transactions on Robotics( Volume: 32 , Issue: 6 , Dec. 2016 ) #### Notes by: [Aniket Gujarathi](https://www.linkedin.com/in/aniket-gujarathi/) ### Introduction **WHAT is SLAM ?** * SLAM comprises of two parts - simultaneous state estimation and construction of a model(map) of the environment using the information from the sensors. * There are two main uses of a map: * Required to support other tasks like planning, visual representation to a human operator, etc. * Limiting the error committed in state estimation. * SLAM is used where a prior map is not available and needs to be built. * Sometimes, the exact positions of landmarks can be known prior - beacons, GPS (can be considered as beacons), etc. * But, the indoor applications rule out the use of a GPS to limit the error. In such scenarios, SLAM comes into picture. * Lower level SLAM - Intersection of other research fields like computer vision, signal processing, etc. * Higher level - Mix of geometry, graph theory, optimizations, etc. * The paper presents an overview of the current state of SLAM and the open problems and future scope. **Do autonomous robots need SLAM ?** * Earlier, robots used just odometry information to estimate the pose of the robot. However, these readings taken from wheel encoders drifted with time. * This led to the introduction of SLAM, that used observation of external landmarks to reduce the drift and increase accuracy. With the current visual and inertial information, this problem has sufficiently been solved. **So do we actually need SLAM ?** 1. The SLAM research done in the last decade has led to emergence of research topics related to visual inertial navigation(VIN), which can be expressed as a reduced SLAM technique. SLAM has led to study of sensor fusion under more challenging setups. 2. The second point is related to estimating the true topology of the environment. Without loop closure ("loop closure" is very important. Without loop closure SLAM is just reduced to odometry), the robot performing odometry estimates the world as an 'infinite corridor' (left figure), whereas by using loop closure, the robot gets information that the corridor keeps intersecting itself (right figure). This concept can be observed in the figure below. ![](https://i.imgur.com/hRa1uhK.png =300x100) **So why can't we give up the metric information and go ahead with place recognition?** * The metric information makes place recognition more robust and discards false loop closures. * SLAM provides a natural defense against wrong data association, where similar looking scenes may deceive place recognition algorithms. 3. Also SLAM is used because many applications require a globally consistent map for operation. **So is SLAM solved ?** * As SLAM has become such a broad field, to answer this question, certain aspects must be considered first about the system - robot (motion type, sensors, etc.), environment (dimension, presence of landmarks, etc.), performance requirements. * Mapping a 2D indoor environment with wheel encoders and laser scanner can be considered largely solved. * Vision based navigation with slow moving robots(eg. Mars Rover), visual-inertial odometry, etc can be considered as mature research fields. * Combinations of the aspects (robot, environment, performance) still require large amount of fundamental research. * For this third era of SLAM, the key requirements are - robust performance, high level understanding, resource awareness, task-driven perception. ### Anatomy of a modern SLAM system * The architecture of SLAM comprises of two parts - front-end and back-end. * Front-end - Abstracts sensor-data into models for estimation. * Back-end - Performs inference on the abstracted data. ![](https://i.imgur.com/0gs6yQM.png =350x100) #### Maximum a posteriori (MAP) estimation and SLAM back-end * Many approaches of SLAM assume slam as a maximum a posteriori estimation problem and use factor graphs to reason about the interdependence among variables. * Unknown variable - $X$, (it contains the trajectory of the robot in the form of discrete set of poses). * Measurements - $Z$ such that $z_k = h(X_k) + \epsilon_k$. $h_k(.)$ is a known function (measurement or observation model) and $\epsilon_k$ is the random measurement noise. * In MAP estimation, $X$ is estimated by computing $X^*$ that attains maximum posterior $p(X|Z)$ $X^* = argmax(p(X|Z)) = argmax(p(Z|X)p(X)) = argmax_x(p(X)\prod_{k = 1}^{m}p(z_k|X_k))$ * $p(Z|X)$ is the likelihood of measurements given the assignment $X$, $p(X)$ is the prior knowledge about $X$. * Unlike Kalman filter, MAP estimation does not require a distinction between motion and observation models (both are treated as factors). * As maximizing the posterior is same as minimizing negative log posterior, the MAP estimate reduces to non-linear least squares problem. * This log minimization problem is generally soved by successive linearizations eg. Gauss-Newton ,etc. * The factor graph representing the above SLAM problem can be structured as follows - ![](https://i.imgur.com/p4w0wpz.png =300x150) ###### Blue circles - robot poses at consecutive time-steps, green circles - landmark positions, red circle - variable associated with intrinsic calibration params, "u" - odometry constraints, "v" - camera observations, "c" - loop-closures, "p" - prior factors * The SLAM formulation described so far is commonly referred to as maximum a posteriori estimation, factor graph optimization, graph-SLAM, full smoothing, or smoothing and mapping (SAM). * MAP estimation has proven to be more effective that original approaches of SLAM based on non-linear filtering. * The performance mismatch between MAP estimation and EKF minimized if the linearization point of the EKF is accurate. #### Sensor dependent SLAM front-end * In practical robotics applications, it is difficult to write the sensor measurements as an analytical function of the state as required by MAP estimation. * Hence, a front-end that extracts relevant features from the sensor data is required before the back-end. * For eg. In vision based SLAM, the front-end extracts pixel locations of a few distinguishable points in the environment, which are now easy to model in the back-end. * The front-end is also responsible for data association - associating each measurement to a specific landmark. * Finally, the front-end may also takes care of landmark initialization or providing an initial guess for the variables in the non-linear optimization. * Short term data association - Responsible for associating corresponding features in consecutive sensor measurements. * Long term data association - Associating new measurements to older landmarks. ### Long-Term Autonomy 1 - Robustness * A slam system can be fragile due to many aspects like hardware or software failures, limitations of existing slam algorithms(eg. performance in harsh and dynamic environments), sensor degradation, etc. * One of the main problems for algorithmic failure is data association. * There is a phenomenon where different sensory inputs give same sensor signature leading to false positives and hence wrong data association. This is called '**Perceptual Aliasing**' * If such measurements are taken into account then the result is erroneous, and if they are neglected then fewer measurements are used for estimation. * This is made worse due to unmodelled dynamics of the environment including short-term and seasonal changes. * A fairly common assumption in SLAM is that the world remains unchanged as the robot moves in it (i.e. the landmarks are static). * However, this is valid only for smaller environments where there are no short-term dynamics(eg. people and objects moving). In large scale scenarios and longer time scales, change is inevitable. * There are two main issues in dynamic environments. * First, the slam system has to detect, discard or track changes. Many mainstream approaches do discard the dynamic portion of the scene, but some works include the dynamic elements as part of the model. * Secondly, the slam system has to understand when and how to update the map. * Current slam systems that deal with dynamics either maintain multiple maps of the same location or have a single representation parametrized by some time-varying feature. #### **Open Problems** * *Failsafe SLAM and recovery*: * SLAM is still vulnerable in the presence of outliers. This is mainly because all slam techniques are based on optimization of non-convex costs. * This has two consequences: * Outlier rejection outcome depends on the quality of the initial guess fed to the optimizer. * The slam system is inherently fragile(i.e even single outlier greatly affects the estimate) * An ideal SLAM system should be **fail-safe and failure aware(i.e it should know about imminent failures and act to avoid it)**. None of the existing slam approaches provide these. * A possible approach can be a tighter integration between front-end and back-end of slam, but how to achieve it is an open question. * *Robustness to HW failure*: * Sensor degradation may greatly affect the slam estimates. * So some research questions may be: * How to detect degraded sensor operation? * How to adjust sensor noise stats(covariances and biases)? * How to resolve conflicting information from different sensors? * *Time varying and deformable maps*: * Mainstream slam approaches have been based on static and rigid world assumptions. However, both the assumptions are flawed. * An ideal SLAM solution should be able to reason about dynamics in the environment including non-rigidity, work over long time periods generating “all terrain” maps and be able to do so in real time. * The problem of rigid maps has been explored before but a large-scale implementation is still largely unexplored. * *Automatic parameter tuning*: * For slam to be able to work out-of-the-box, automatic tuning of involved parameters need to be considered. ### Long-Term Autonomy 2 - Scalability * SLAM algorithms work quite well in indoor environments but now it should work on increasing the period of operation of the robots and on larger areas (eg. ocean exploration, non-stop cleaning robots in cities, etc). * For such application, the factor graph can grow unbounded due to continuous exploration of new places and the increasing time of operations. * Therefore it is important that the computation and memory complexity is bounded. * Two ways are explained to reduce the complexity of facor graph optimization: * Sparsification methods which trade-off information loss for memory. * **Multi-robot methods**, which split computation among different robots. The key idea is to split up the factor graph into different subgraphs and optimize the overall graph. It is known as 'submapping algorithms' * One way is to deploy multiple robots doing slam and divide the large area into smaller areas, each mapped by a different robot. There can be a centralized location where the robots can send the local maps, that performs inference and a decentralized location used for local communication between the robots to collaborate on a common map. (**Note**: The brief survey of the techniques used to achieve these are explained properly in the paper). #### **Open Problems** * *Map representation*: * How to store the map during long-term operation? * Even if memory is not an issue(with cloud storage), raw representation as point clouds or volumetric maps are wasteful in terms of memory and similarly storing feature descriptors for vision based slam makes slam cumbersome. * *Learning, Forgetting, Remembering*: * How often to update the information contained in the map and how to decide when this information becomes outdated and can be discarded? * When is it fine to ever forget? * What can be forgotten and what to be maintained? * Can parts of the map be “offloaded” and recalled when needed? * *Robust distributed mapping*: * Approaches for outlier rejection methods have been proposed for single robot case but the literature barely deals with multi-robot slam approach on outlier rejection. * Outlier rejection for multi-robots is difficult as, the robots may not share a common reference frame making it harder to detect and reject loop-closures. Also the robots have to detect the outliers from very partial and local information. * One early approach for this was in which robots verify location hypotheses using a rendezvous strategy before any information fusion. * * Resource constrained platforms*: * How to adapt existing SLAM algorithms to the case in which the robotic platforms have severe computational constraints? ### Representation 1 - Metric Map Models * This section discusses how to model geometry in slam. * Geometric modelling is easier in 2D - * landmark-based maps - models the environment as a sparse set of landmarks * occupancy grid maps - discretizes the env in cells and assigns a probability of occupation to each cell. * However, modelling 3D geometry is currently still in its rudimentary stages. Some work are: * Landmark based sparse representation: * Most SLAM methods represent the scene as a set of sparse 3D landmarks corresponding to discriminative features in the environment (e.g., lines, corners). * A common assumption in this is that the landmarks are distinguishable: i.e. sensor data measures some geometric aspects of a landmark and also provides a descriptor which establishes data association. * *Low-level raw dense representations*: * Contrary to landmark based representation, this focuses on providing a high resolution models of the 3D geometry. (more suitable for obstacle avoidance, visualization, etc.) * These representations are visually pleasant but require storing a large amount of data. Moreover, they give low level description of the geometry. * *Boundary and spatial-partioning dense representations*: * These representations rise above the low level primitives(eg. points) and attempt to represent surfaces and volumes. * Used in tasks like motion or footstep planning, obstacle avoidance, manipulation and other physics related reasoning. * Spatial partioning representations define 3D objects as a collection of non-intersecting primitives. * Most popular is spatial-occupancy enumeration - decomposes 3D space into identical cubes(voxels). * Let us first understand how sparse(feature-based representations) compare to dense ones in SLAM. * **Which one is best: feature-based or direct methods?** * Feature based systems allow building accurate and robust slam systems with loop-closure and automatic relocation. But they depend on features in the environment, the reliance of detection and matching thresholds, and that most feature detectors are optiized for speed than accuracy. * On the other hand, direct methods use raw pixel information and exploit all the information present even where gradients are small, thus can outperform the feature based system. However, this makes it computationally very expensive. * There are two alternatives to overcome these - * Semi dense methods - Overcome high computation requirements by exploiting only the pixels with strong gradients. * Semi direct methods - Leverage bith sparse features and direct methods. * *High Level Object Representations*: * Solid representations are used in this. They encode the fact that objects are 3D and not 1D (points) or 2D (surfaces), which allow to associate physical notions of the objects. * The existing CAD and computer graphics literature helps to strengthen this representation. * Few examples of solid representation: * *Parametrized Primitive Instancing*: 1. Relies on families of objects(cylinders, sphere, etc.). For each family, a set of parameters are defined(eg. radius, ht) to uniquely identify a member(instance) of a family. * *Sweep representations*: 1. Defines a solid as the sweep of a 2D or 3D object along a trajectory through space 2. For eg, a cylinder can be interpreted as a translational sweep of a circle along the orthogonal axis. * *Constructive solid geometry*: 1. An object is stored as a tree with leaves as primitives and edges represent orientations. #### **Open Problems** * *High level expressive representations in SLAM*: * Most of the community focuses on point clouds or TSDFs to model the 3D geometry. But they have mainly two drawbacks: * First, they are wasteful of memory. They use many parameters to represent even a simple environment like an empty room. * Second, they do not provide high-level understanding of the 3D geometry. For eg, a robot moving in a room or hallway will not be able to distinguish between the two by just the low-level information provided. * Therefore, higher level representations are important as they have compact representations, high-level description of objects. * **No slam techniques can currently build higher-level representations than point-clouds, mesh models, surfel models and TSDFs.** * *Optical Representations*: * Optimal representation is the one that enables preforming a given task, while being concise and easy to create. * Finding a general yet tractable framework to choose the best representation for a task remains an open problem. * *Automatic, Adaptive Representations* ### Representation 2 - Semantic Map Models * This deals with associating semantic concepts to geomteric entities in a robot's surrounding. * Semantic mapping helps to enhance the robot's autonomy and robustness and overcome the drawback of using only geometric maps. #### **Open Problems** * The problem of including semantic information in SLAM is still in its infancy. * *Consistent semantic-metric fusion*: * The problem of consistently fusing several sources of semantic information with metric information coming at different points in time is still open. * *Semantic mapping is much more than a categorization problem*: * *Ignorance, awareness, and adaptation*: * Given some prior knowledge, the robot should reason about new concepts and their semantic representations in the environment. * For eg, if a robot classifies a road as drivable earlier, and now there is mud on the road, the robot should know now that the road is undriveable now and inform the planner depending on the grade of difficulty in driving on the road or adjust its classifier if another vehicle stuck in the mud is perceived. * *Semantic Based Reasoning* * Robots are currently unable to efficiently and effectively localize and map continuously using semantic concepts in the environment. ### Active SLAM * Passive slam suggests that the robot perform SLAM given the sensor data, but without acting deliberately to collect it. * The problem of controlling robot’s motion in order to minimize the uncertainty of its map representation and localization is usually named active SLAM. * A popular framework for active SLAM consists of selecting the best future actions among a finite set of alternatives. * It comprises of three steps in general: * The robot identifies possible locations to explore or exploit (vantage points). * The robot computes the utility to visit all the vantage points and selects the option with highest utility. * Once the task is done, the robot decides wheter to continue the process or terminate it. #### Open Problems * *Fast and Accurate Predictions of Future States*: * In active slam each action should be to reduce the uncertainty and improve the localization accuracy. * For this the robot should be able to forecast the effect of future actions on the map and robot localization. * Efficient methods for this are yet to be devised. * Can be done through machine learning, spectral techniques, deep learning. * *Enough is enough: When do you stop doing active SLAM?*: * Active slam is computationally expensive, so the obvious question is when to stop it. * Balancing active SLAM decisions and exogenous tasks is critical, since in most real-world tasks, active SLAM is only a means to achieve an intended goal. * Performance Guarantees: * To lok for mathematical guarantees for Active SLAM and for near optimal policies. ### New Frontiers - Sensors and Learning * SLAM is based on various types of sensors, so any new advancements in the sensors can be used as an advantage to make the SLAM system more robust. #### **New and Unconventional Sensors for SLAM** * Sensing in robotics has mainly been dominated by lidars and conventional vision sensors. * However, there are many alternative sensors like depth, light-field, event cameras, magnetic olfaction, thermal sensors. * *Range Cameras*: * These are light emitting depth cameras. * Since range cameras carry their own light source, they also work in dark and untextured scenes, which enabled the achievement of remarkable SLAM results. * *Light-field cameras*: * Contrary to standard cameras, which only record the light intensity hitting each pixel, a light-field camera (also known as plenoptic camera), records both the intensity and the direction of light rays. * Light-field cameras offer several advantages over standard cameras, such as depth estimation, noise reduction, video stabilization, isolation of distractors, and specularity removal. Their optics also offers wide aperture and wide depth of field compared with conventional cameras. * *Event driven cameras*: * Contrary to the conventional cameras which capture the scene in frames, event driven cameras capture the individual pixel-level changes caused by movement in the environment at a particular time. * The pixels in an event driven camera work asynchronously and get triggered by the change of intensities in the surrounding independantly. * As a result, they provide a continuous stream of information unlike the information present in multiple frames in a conventional camera. * They have five key advantages compared to conventional frame-based cameras: **a temporal latency of 1ms, an update rate of up to 1MHz, a dynamic range of up to 140dB (vs 60- 70dB of standard cameras), a power consumption of 20mW (vs 1.5W of standard cameras), and very low bandwidth and storage requirements** (because only intensity changes are transmitted). * However as the output is in the form of asynchronous events, traditional computer vision applications based on synchronous frame-based outputs do not work. Hence there is a need for a paradigm shift. #### Deep Learning * It is possible to localize the 6DoF of a camera with regression forest and with deep convolutional neural network, and to estimate the depth of a scene (in effect, the map) from a single view solely as a function of the input image. * However, this does not mean an end to traditional SLAM systems. #### Open Problems * *Perceptual Tool*: * Some research problems that have been unsolved by the computer vision literature can now be addressed with deep learning. * Deep networks show more promise for connecting raw sensor data to understanding, or connecting raw sensor data to actions, than anything that has preceded them. * *Practical Deployment*: * Deep learning has mostly revolved around long training times on GPUs or super computers. * However, the challenge is to provide sufficient computing power in an embedded system. * *Online and life-long learning*: * SLAM systems typically operate in an open-world scenario with continous observation. * But deep networks are trained on a closed-world scenarios with a fixed number of classes. * Challenge is to harness the power of deep networks in a one-shot or zero-shot scenario for life-long learning. * *Bootstrapping*: * Prior information has always enhanced the performance of SLAM systems. * Deep learning is capable of distilling this prior knowledge for specific tasks. * How best to extract and use this information is a significant open problem. * One particular challenge that must be solved is to characterize the uncertainty of estimates derived from a deep network. * **Perhaps it might one day be possible to create an end-to-end SLAM system using a deep architecture, without explicit feature modeling, data association, etc.**