# Dynamics Learning with Cascaded Variational Inference for Multi-Step Manipulation ## Abstract Idea The authors of this paper tries to find effective and plausible action sequences (with states in the latent space) that lead to the task goal. They have introduced two-level process to take advantage of hierachical structure of action space. This two-level process includes a) Generation of high-level plan that tries to generate subgoals in latent space. b) Generation of low level plan (actions) to reach intermediate subgoal from the given state. ## Earlier work and their demerits -> This problem have been dealt under task and motion planning. However, the uncertainity in visual data has hindered its application. For solving multiple tasks in such environments in data efficient manner, methods involving DNN (Deep Neural Networks) and model based planning have been proposed earlier. However, they have suffered from curse of dimentionality of large sampling space. -> In hierarchical RL planning, learning diverse skills for hierarchical policies and controller is commonly done by having different parameterization for each policy. The policy is trained to learn different skills by using a pre-specified set of modes and learning a policy for each mode. In contrast, the authors claim that their approach uses self supervised interactions to learn the latent space representation namely, effects (subgoals in latent space) and motion (detailed motion of action in latent space). -> For skill learning, many multi-modal immtation learning algorithm have been used for to learn diverse set of skillful actions using demonstrations. In contrast the authors claim, their approach learns the structure of action space by determining the effect it causes to the environment. -> Using generative models for planning, many method talks about the probability distrubution corresponding to a single action mode. In constrast the proposed algorithm, decouples the learning process using effects and motions code, each of which can guide the action sampling. ## Authors Contribution 1) Introduce Model Predictive Control (MPC) at two temporal resolutions. The authors claim that this reduces the computational burden of planning for continous action space domain problem over a long horizon. 2) Use cascaded variational inference framework to learn latent space representations for the proposed planner. ## Method The two level process developed by the authors are : a) Generation of high level plan to get the desired effects. Here term effect is being abused for term subgoals in hierachical RL. b) Generation of low level plan of motions to get actions to reach the intermediate subgoal. ### Algorithm Idea They use 2 latent space models to falicitate planning. The first latent space model, namely **effect code**, encodes the desired effect of planning in hierarchy. To facilitate motion (low level action generetation), authors have defined another latent space model, namely **motion code**, to define the notion of motion in action. In simple words, they sample N sequences of effect code and search for the best sequence of subgoals to reach the goal. To compare these sequences of effect code, authors have used an approximate reward funtion which returns a immediate reward value R(s<sub>t</sub> ). Using the best sequence of subgoal, ranked by the approximate reward model, authors uses action generator model to return set of actions to reach the given sub-goal. For action generation, sequences of actions are generated for having a desired effect code ($c_{t}$), by choosing the optimal sequences of motion codes. These terms, namely effect code and motion code, are abused versions (term abuse) of subgoals and distinguished action sequence of T steps (in latent space). The sampled motion codes are paired with the effect code to produce action space using the action generator model. Best set of actions are now determined by their closeness to the sub-goal state after T timesteps. ### Terms used in paper **Effect Code ($c \in C$)** : C is the latent effect space where effect code ( c ) describes the desired affect of action. (Looking for subgoals in the latent space) **Motion Code ($z \in Z$)** : Z is the latent motion space, where ( z ) defines the detailed motion of action over T time step. **Dynamic Model (f)** : Estimate transition probablity in original state space and action space. $s_{t+1} \sim f(.|s_{t}, a_{t}; \theta_{f})$ **Meta Dynamic Model (h)** : Characterize distributions of subgoal state $s_{t+T}$ with T time steps ahead of t. The model depends on $c_{t}$ which is the latent space representation of given state $s_{t}$. $s_{t+T} \sim h(.|s_{t}, c_{t}; \theta_{h})$ **Action Generator (g)** : Using the above two model, action generator returns a set of action in low level, to reach the intermediate subgoal under T timesteps. The generator uses motion code $z_{t}$ which defines detailed motion of action. $a_{t:t+T-1} \sim g(.|s_{t}, c_{t}, z_{t};\theta_{g})$ ### Terms abused in paper $s^{'} : s_{t+T}$ $s^{''} : s_{t+1}$ $a : a_{t:t+T-1}$ ### Cascaded Variational Inference They constructed two inference models for approximate posterior distributions $q_{h}(c|s, s^{'}; \phi_{h})$ and $q_{g}(c|s, a, c; \phi_{g})$ both as inference NN with trainable parameters. Using ELBO they have turned the whole problem into minimization problem for parameters $\theta_{h}, \theta_{g}, \phi_{h}, \phi_{g}$ where: #### Meta dynamic ($p(s_{'}|s; \theta_{h})$) Variational Bound ($J_{h}$) $log \ p(s_{'}|s; \theta_{h}) \geq \mathbb{E}_{q_{h}(c|s, s^{'}; \phi_{h})}[log \ h(s^{'}|s, c; \theta_{h})] - D_{KL}(q_{h}(c|s, s^{'};\phi_{h}) || p(c)) = -J_{h}$ #### Variational Bound on $p(a|s, s^{'}; \theta_{g})$ It is difficult to maximize the marginal probability of $p(a|s, c; \theta_{g})$, because the latent variable c is intractable (c is being learned over the process). Instead they do this with maximization of $p(a|s, s^{''}; \theta_{g})$. To do so, they have assumed that we have the value of c given. Under this assumption they do a variational bound $J_{g|c}$ of the marginal likelihood $p(a|s, c; \theta_{g})$. $log \ p(a|s, c; \theta_{g}) \geq \mathbb{E}_{q_{g}(z|s, a, c; \phi_{g})}[log \ g(a|s, c, z; \theta_{g})] - D_{KL}(q_{g}(z|s, a, c;\phi_{g}) || p(z)) = -J_{g|c}$ $p(a|s, s^{'}; \theta_{g}) = \mathbb{E}_{q_{h}(c|s, s^{'}; \phi_{h})}[p(a|s, c; \theta_{g})]$ Using the ELBO equation for $p (a|s, c; \theta_{g})$, we can write the following equation as: $p(a|s, s^{'}; \theta_{g}) > - \mathbb{E}_{q_{h}(c|s, s^{'}; \phi_{h})}[J_{g|c}] = -J_{g}$ (authors have $\geq$ in equation 7 of the paper) The overall objective loss for minimization is $J_{g} + J_{h}$ --- #### New Additions **Note** : x > log (x) for x > 0. hence $p(a | s, c ; \theta_{g}) > log \ p(a | s, c; \theta_{g}) \geq -J_{g|c}$. Shouldn't be there **$>$** instead of **$\geq$** in equation 7 (between the two expectations)of the paper? ## Experiments The authors have concentrated on evaluating their algorithm on specifically 3 points: a) Performance ability on different multi-step tasks b) Performance ability on various scene complexity c) Kind of robot behaviour the algorithm produces ### Experimental Setup Setup includes : a) 7 DOF Sawyer Robot Arm b) Table surface c) Kinect2 depth sensor Objects (atmost 5 objects) for the environment are drawn from a subset of YCB dataset. ![image alt](https://i.imgur.com/4isH90N.png "Experimental Setup (Ref : Discussed Paper)" =250x250) Figure 1 : Experimental Setup ### State Space Observation for the environments include segmented point cloud represented by m x n x 3 dimensions in the 3D space. Here m is taken as the number of movable object and n=256 is the fixed number of points sampled on each object. State is composed of m x 3 object centers and m x 64 geometric features. The object center and the geometric features are processed from the input segmented cloud, where the geometric features are extracted from [PointNet](https://arxiv.org/abs/1612.00593). Each push is a straight line motion with maximum moving distance of 0.1 meters in the 2D space. Planning Horizon is 30 steps and each episodes terminates after 60 steps. For replanning, robot arm moves out of the camera view to take an image after every T=3 timesteps. ### Tasks Tasks are divided into three categories: a) Clearing b) Insertion c) Crossing For each task, dense and sparse rewards are defined respectively. The dense reward pose a easier challenge by providing intermediate reward signals, such as Euclidean distance to the goal. The sparse reward settings, comparitively tough, requires interaction with different objects in a strategic manner. ### Quatitave Comparisions #### Task Performance in dense and sparse reward settings Each method is evaluated with experiments below 1000 episodes (with 1024 samples in every planning step) under similar random seeds. The authors have discussed performance for both, dense and sparse, reward settings. ##### Dense Reward Setting Quantitavely, the authors claim to have significant increase in the performance. Marginal performance constitutes increase of 10 % in Clearning, 11.7 % in Insertion and 13.9 % in Crossing over second best available methods. It should be noted that the algorithm have capabilities of longer-term planning, given the magnitude of performance increase in Insertion and Crossing task. These task require long-term planning for success. ##### Sparse Reward Setting Quantitavely, the performance increase over sparse reward settings, constitute a performance increase of 20.1 % in Clearning, 14.6 % in Insertion and 11.1 % in Crossing over second best available methods. These results shows that the algorithm is able to perform well for sparse reward settings utilizing the hierarchical nature of the algorithm. The authors claim that the algorithm effectively rules out unsuccessful effects (subgoals in latent space) in the high-level planning process. It finds out the sequences of states which will lead to the task goal. This is evident from the quantitave results they have presented to support their argument. The authors points out the fact that only small fraction of the acions in the large search space. This has been a major flaw in earlier proposed methods where actions were searched in a huge search space. #### Performance across number of objects The authors claim that increase in number of objects leads to performance gap in all the methods. It should be noted that increase in objects increases the task complexity. The authors note a drop of 9.4 % (Clearing task) when increasing the number of objects from 1 to 5, compared to a drop of 51.1%, 53.3% and 70.2% on other baselines. The input to the function approximator are point clouds which have less reality gap. The authors evaluate their algorithm on real world robotic setting, noting a success rate of 93.3% in clearing, 73.3 % in insersion and 80.0% in crossing. ### Qualitative Results The authors have studied the statigies planned by the algorithm. They noticed that the algorithm was able to perform diverse set of stategies for given tasks. ![](https://i.imgur.com/qXF7ctc.png "Experimental Results (Ref : Discussed Paper)") The results presented by the paper are compared against MPC, [CVAE-MPC](https://arxiv.org/abs/1703.02018), [SeCTAR](https://arxiv.org/abs/1806.02813).