AAAI Presentation ==== ## P2 Navigation technique is the core of modern robotic system and is applied to many fields such as intelligence manufacturing and self-driving. The classic visual navigation framework can be divided into two parts. First, we have the perception part, the vision sensors capture the images which is taken as the input of Simultaneous Localization And Mapping algorithm to reconstruct the structure map and get the camera pose. Then, the information of pose and map are sent to the motion control part to extract the potential path and compute control signal. ## P3 However, the structure map built by SLAM lack the abstract semantic information and are not suitable for advanced navigation task without additional information. For example, collect the items with specific appearance as the demonstration video. ## P4 With the progress of deep reinforcement learning, the intelligence agent can directly learn how to control based on the observation. This makes the agent skip the perception part and directly extract the control signal based on observations. However, the lack of surrounding information makes the agent fail to learn a well policy. ## P5 To solve this problem, some researches explore to apply memory architecture on deep reinforcement learning models. For example, maintaining a memory and adopting the attention mechanism to fuse the memory features. Although these models improve the performance of RL agent on spatial tasks, they still have the following problems: First, the learned memory does not model the spatial relation. And second, the learned memory is based on a specific task which requires retraining for different tasks even if the scene is the same. ## P6 It is more appropriate to construct the memories with features that are independent of tasks. The concept is related to representation learning and unsupervised learning in 3D space. In recent years, neural scene representation models are proposed to extract the implicit representations for novel view rendering. The figure illustrates the concept of a neural scene representation model, Generative Query Network. The representation network takes the observation images and the corresponding camera poses as the input to extract the implicit scene codes. Then, the scene codes are taken as the input of generation network to render the images of novel views. The experimental results of GQN proves that scene codes are beneficial for RL models in the spatial related tasks. ## P7 In this work, our basic idea is to combine the neural scene representation model with deep reinforcement learning to achieve advanced navigation tasks. We design a two-step learning framework. First, we train a neural scene representation model to learn the prior knowledge of scene structures and textures. Then, the implicit scene codes extracted by the neural scene representation model are then taken as the input state of deep reinforcement learning. The neural scene representation model corresponds to the perception part and the DRL model corresponds to the motion control part in classic navigation framework. We can train several DRL models for different tasks but utilize the same representation network. ## P8 To adapt to the advanced navigation tasks, the neural scene representation models should satisfy the following properties. First, generalization ability. The models should adapt to the scenes with structural and texture variations rather than only memorizing a single scene. Second, memory expansion ability. The representation should be expandable for storing the information of new area rather than only memorizing the fixed-size scenes. Third, real-time inference. The model should inference the scene representation from observations immediately rather than offline optimization. ## P9 Our previous work STR-GQN is one of the state-of-the-art extension of generative query network. Based on the architecture of GQN, STR-GQN adopts a spatial transformation routing mechanism to model the message passing between image features and memory cells. The model can adapt to the scenes with certain degree of variation in terms of structure and texture, and can directly inference the scene representation by the forward computation of encoder network. However, the memory size of STR-GQN is fixed, which means it lacks the memory expansion ability. ## P10 The other state-of-the-art neural scene representation model is neural radiance field. Neural radiance field learns the mapping from the position and viewing direction to the color and density. By computing the integral of the color on the ray-tracing line, we can get the rendering results of each pixel. The model can easily learn the information of new area by adding new training samples and re-training the network , while the re-training process is hard to achieve real-time. Furthermore, neural radiance field can only memorize a single scene and failed to generalize to difference scenes. ## P11 In this work, we propose the spatial aware memory control to handle the information transformation between an expandable memory and 2D image features. ## P12 Each memory cell is composed of key and value, in which the key represents the 3D location, and the value stores the local abstract features of that 3D location. The memory control process simulates the camera projection and back-projection while it does not require the camera intrinsic parameters or depth supervision. ## P13 We named our proposed network architecture as scene memory network, SMN. The computation of the proposed SMN have two parts, memory saving and memory loading. In saving part, the model takes the observation image and pose as the input to update the memory cells. In loading part, the model takes the pose and memory cells as the input to render the query image. ## P14 Let us take a closer look at the memory saving process. The encoder first extracts the 2D image features, fo. The memory controller takes the memory keys as the input to compute the relation matrix R and memory mask m. The relation matrix represents the relation between each pixel feature and each memory cell. And the memory mask represents the scale of updating for each memory cell. The relation matrix is normalized to get the attention matrix by softmax operation. Then, the image features are multiplied by the attention matrix and scaled by the memory mask to get the writing feature f write. The updated memory values are then computed by adding the writing feature on the current memory values. ## P15 This figure shows the details of memory controller, each memory key is first transformed into the camera space, and then is sent to an embedding network “net key” to extract the embedding of each 3D location and the memory mask. Each 2D pixel position of the image feature is also sent to an embedding network “net pos” to extract the embedding of each 2D position. The relation matrix is then computed by the inner product of the 3D location embeddings and 2D pixel position embeddings. ## P16 The memory loading process is the inverse of the memory loading process and shares the same memory controller. The attention matrix of loading process is computed by normalizing the relation matrix through different dimensions from saving process. A sigmoid function is operated on the memory value to get the reading feature “f read”. The reading feature is first scaled by the memory mask, and then is multiplied by the attention matrix to get the image features of the query view. The image features are taken as the input of generative decoder model to render the image of the novel view. ## P17 After reaching a new area, the agent can easily achieve memory expansion by uniformly sampling the memory keys in new area and initializing the memory values with zero. ## P18 We evaluate the proposed scene memory network on two datasets The first dataset is rooms free camera, RFC dataset, which is proposed in generative query network. The scenes in RFC contain one to three objects in a fixed size square room. The camera is placed on the xy-plane with 3 degree of freedom. The other dataset is maze dataset, which is generated based on our proposed 3D maze generation framework Orario3D. There are three sub-datasets used to evaluate the generalization ability for the spatial scales. The first is maze-local, in which the training and testing data are collected in the 6x6 area in 11x11 mazes. The second is maze-base, in which the training and testing data are collected in the overall 11x11 mazes. The third is maze-large dataset, in which the training data is collected in the 11x11 mazes, and the testing data is collected in the 17x17 mazes. ## P19 This page shows the evaluation of rendering for the proposed scene memory network. The top table shows the root mean square rendering error of each method. Only our proposed scene memory network can achieve good performance in both small and large scenes. The figure shows the rendering results of different methods. The images with the blue frames denote the observation images and the blue circles on the map represent the poses of the observations. The images with the green frames and the green circle denote the rendering image of different methods and rendering poses, respectively. We show two rendering results for each maze. In maze-base set, STR-GQN and our proposed scene memory network have similar performance, while in the maze-large set, STR-GQN fails to render the views which exceeds the scene size of the training data. Only the proposed scene memory network can render the reasonable results. ## P20 This page illustrates the visualization of memory control. We randomly sample the memory cells on the xy-plane and observe the distribution of relation matrix and the memory mask. The memory mask is the scale of the updating for each memory cells, The left figure shows that the magnitude of each memory mask is related to the viewing frustum of the camera. The relation matrix represents the relation between 2D pixel positions and 3D locations. We find the largest relation value of each memory cell with respect to different pixel position and draw the color correspondingly to the x-axis of the pixel position. We can observe that the dots with the same color correspond to the ray-tracing line. The visualization results prove that the proposed spatially aware memory controller does learn the concept of camera model. ## P21 We demonstrate the real-time memory construction and the visualization of the memory. We take the first 3 channels of each memory value as the RGB color of each dot and plot the dot base on the xy-coordinate of the memory key. The image from left to right is the rendering result, observation image, map , and the memory, respectively. ## P22 We can observe that the distribution of the constructed memory reveals the scene structure. ## P23 To combine the proposed scene memory network with deep reinforcement learning model, we apply a multi-head attention module to fusion the information of scene memory. The memory key, memory value and a learnable vector are taken as the input of multi-head attention module to get the attention feature “f att”. The attention feature is concatenated by the observation image feature and is taken as the input of the Q network to estimate the action value. We apply the double DQN with dueling network as the core reinforcement learning algorithm. ## P24 However , we find that the agent is easy to be trapped in a local region and learn a worse policy. The reason is that the targets are often invisible due to the occlusion, It is difficult for the agent to get the reward without the hint of target, which leads to the failure of learning. To solve this problem, we design a curiosity-based reward to guide the agent exploring the scene. We apply a sigmoid operation to transform the memory value into the probability form, then compute the entropy difference between two timesteps to evaluate the degree of exploration. ## P25 We evaluate the proposed RL agent on the item-collection task. The agent receive 1 points if it collect a red ball. There are fifteen red balls placed in each scene. We compare the proposed method with the following RL agents. DQN is the original RL algorithm without the memory model and curiosity reward. DQN+Epi represents the method that utilizes the state-of-the-art episodic exploration reward proposed in Never Give Up. DQN+Ent represents the method which utilizes our proposed entropy-based curiosity reward but not apply the memory model. SMN-DQN is the methods that utilize the memory model but without the curiosity reward. SMN-DQN+Ent is our proposed method with memory model and curiosity reward. We can observe that both the memory model and the curiosity reward improve the performance of the agent, and have better performance than the state-of-the-art episodic memory agents. ## P26 We demonstrate the behavior of the agents with and without the different elements of the proposed method. We can find that without the memory model or the curiosity reward, the agent is easy to be trapped in the local region and get worse score.