# [2-1] "Embodied Intelligence via Learning and Evolution" ###### tags: `publish` ### Information Authors: Agrim Gupta, Silvio Savarese, Surya Ganguli, Li Fei-Fei (2021) Review date: 02.17.2021 Category: "The reinforcement learning and active learning" ### Summary Evolution in the past billion years produces various morphologies of lifes. Authors try to answer three open questions in the processes of learning and evolution: (1) is there a computational framework which can learn evolved morphology given different environments and tasks efficiently, (2) does Baldwin effect exist, and (3) what is the mechanistic basis of Baldwin effect and the emergence of morphological intelligence? I do not agree with the biological interpretation, but the result of simulation is fascinating. ### Previous research There are two difficulties in computatioanl evolutionary research: (1) large number of possible combinatorial morphologies, and (2) long computational time. Therefore, limited morphological space, limited learning parameter, limited environmental senor data, and Lamarckian theory were applied. ### UNIMAL: A UNIversal aniMAL morphological design space ![](https://i.imgur.com/gMq3qbW.jpg) - Agent design: kinematic tree. (1) a sphere represents the head and the root of the tree, (2) cylinders represent the limbs of the agent. - Evolutionary strategies: three classes of mutation operations. (1) adding or deleting limbs, (2) changing the length and densities of limbs and head, and (3) changing the properties of joints between limbs. - *only the mutation keeps the center of mass lies on the sagittal plane is valid. This reduce need of learning left-right balancing.* - Learning environments: three types of environments. (1) flat terrain (FT), (2) variable terrain (VT), and (3) non prehensile manipulation in varialbe terrain (MVT). VT contains stochastically generated fields, on top of that, MVT contains an object and the agent has to carry the object to the destination. ### DERL: Computational framework for createing embodied agents - The asyncronous tournament based evolution separates the mutation process and the learning process, this avoids computatioanl burden, but keeps the diversity of morphologies. - Initialization: population of P=576 random generated agents with unique topologies. The population remains the same by keeping the top fitness descendents. - Reinforcement learning algorithm: At each step, agent receives an observation $o_t$ that does not fully specicfy the state $s_t$ of the environment, takes an action $a_t$ and it given a reward $r_t$. A policy $\pi_\theta(a_t|o_t)$ models the conditional distribution over action $a_t \in A$ given an observation $o_t \in O(s_t)$. The goal is to fina a policy which maximizes the expected cumulative reward $R=\sum^{H}_{t=0}\gamma^tr_t$ under a discount factor $\gamma \in [0,1)$ where $H$ is the horizon length. - Observations: low level egocentric proprioceptive and exteroceptive observations, such as "agent morphology, joint angles, angular velocities, readings of a velocimeter, accelerometer, gyroscope positioned at the head and touch sensors attched to the limbs and head as provided in the MuJoCo simulator" and "task specific information like terrain profile, goal location and the position of objects and obstacles", respectively. - Rewards: different to either morphology dependent reward functions or limit the design space to morphology, authors keep the reward design simple. - For FT and VT, at each time step $t$: - $r_t = w_xv_x-w_c ||a||^2$, $v$ is velocity in the $+x$ direction, $a$ is the input to the actuators, $w_x$ and $w_c$ are weights, which is $1$ and $0.001$ respectively. - For MVT: - $r_t = w_{ao}d_{ao}+w_{og}d_{og}-w_c||a||^2$, $d_{ao}$ is geodesic distance between the agent and the object, $d_{og}$ is geodesic distance between the object and the goal. The weight $w_{ao}=w_{og}=100$, and $w_c=0.001$. - Policy architecture: a stochastic policy $\pi_{\theta}$ where $\theta$ are the parameters of a pair of DNN. Each type of observation is encoded via a two layer MNLP with hidden dimesnions [64, 64]. The encoded observations across all types are then concatenated and futher encoded into a 64 dimensional vector, which is finally passed into a linear layer to generate the parameters of a Gaussian action policy for the policy network and discounted future returns for the critic netwrok. The size of the output layer for the policy network depends on the number of actuated joint. We use $tanh$ non-linearities everywhere, execept for the output layers. The pararmeters of the networks are optimized using Proximal Policy Optimization (PPO). - Optimization: ![](https://i.imgur.com/u8LEL7N.png) ### Successful evolution of diverse morphologies in complex environments ![](https://i.imgur.com/wQcmZKZ.jpg) - Previously, often only 1 solution and its nearby variations dominate. But DERL let the lower inital fitness ancestor still contribute a high fit descendants. - a. the fitness of top 100 agents in each of 3 evolutonary runs. - b. Even the inital rank is low, the final relative abundance can still be high, which indicates the evolved fitness, especially in MVT. - c. - e. darker means better fitness. This tells multiple lineages with high fitness descendants can originate from lower fitness ancestors. ### Environmental complexity engenders morphological intelligence ![](https://i.imgur.com/VYk9696.png) ![](https://i.imgur.com/TkJDyTT.png) - Morphological intelligence: rapidly adapt to any new tasks. - The controllers for each task are learned from scratch, so no transfer learning. - These results suggest morphologies evolved in more complex environments are more intelligent in the sense that they facilitate learning many new tasks both better and faster. ### Demonstration of a stronger form of the conjectured morphological Baldwin effect ![](https://i.imgur.com/SOWRgeB.png) - Baldwin conjectured behaviors that are intially learned over a lifetime in early generations of evolution will gradually become instinctual and potential even genetically transmitted in later generations. - This simulation is the first evidence for the existence of a morphological Baldwin effect by showing evolution selects for faster learners without any direct selection pressure for doing so. - **I am not an expert in evolutionary theory, however, I cannot totally agree with the conclusion here. From my point of view, if there is no selection, there is no evolution. Selecting parents with highest fitness score is clearly a direct selection. This is like the student with highest college entrance exam score, is likely to perform well in harder courses/tasks in college, especially the evaluation standard are the same in different tasks, such as the reward, cost of work, etc. Those high fitness agents, does possess some talent (features) that facilitate the learning ability and efficiency in new tasks.** ### A mechanistic underpinning for morphological intelligence and the strong Baldwin effect ![](https://i.imgur.com/mdncqWK.png) - These correlations (fig5 and fig6) suggest that energy efficiency and stability may be key physical principles that partially underpin both the evolution of morphological intelligence and the Baldwin effect. - VT/MVT agents are also more energy efficient compared to FT agents