# Humanoid Stepper Paper Review ### Paper List 1. Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation." arXiv preprint arXiv:1506.02438 (2015) 2. Pratt, Jerry, et al. "Capture point: A step toward humanoid push recovery." 2006 6th IEEE-RAS international conference on humanoid robots. IEEE, 2006. 4. Meduri, Avadesh, Majid Khadiv, and Ludovic Righetti. "Deepq stepper: A framework for reactive dynamic walking on uneven terrain." 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021. 5. Koolen, Twan, et al. "Capturability-based analysis and control of legged locomotion, Part 1: Theory and application to three simple gait models." The international journal of robotics research 31.9 (2012): 1094-1113. 6. Pratt, Jerry, et al. "Capturability-based analysis and control of legged locomotion, Part 2: Application to M2V2, a lower-body humanoid." The international journal of robotics research 31.10 (2012): 1117-1133. 7. Duan, Helei, et al. "Sim-to-real learning of footstep-constrained bipedal dynamic walking." 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022. 8. Duan, Helei, et al. "Learning Dynamic Bipedal Walking Across Stepping Stones." 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022. 9. Boussema, Chiheb, et al. "Online gait transitions and disturbance recovery for legged robots via the feasible impulse set." IEEE Robotics and automation letters 4.2 (2019): 1611-1618. 10. Blickhan, Reinhard. "The spring-mass model for running and hopping." Journal of biomechanics 22.11-12 (1989): 1217-1227. 11. Vaughan, Christopher L., and Mark J. O’Malley. "Froude and the contribution of naval architecture to our understanding of bipedal locomotion." Gait & posture 21.3 (2005): 350-362. 12. Boussema, Chiheb, et al. "Online gait transitions and disturbance recovery for legged robots via the feasible impulse set." IEEE Robotics and automation letters 4.2 (2019): 1611-1618. 13. Englsberger, Johannes, et al. "Bipedal walking control based on capture point dynamics." 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2011. 14. Wu, Albert, and Hartmut Geyer. "The 3-D spring–mass model reveals a time-based deadbeat control for highly robust running and steering in uncertain environments." IEEE Transactions on Robotics 29.5 (2013): 1114-1124. 15. Wieber, Pierre-Brice. "Viability and predictive control for safe locomotion." 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2008. 16. Wensing, Patrick M., and David E. Orin. "High-speed humanoid running through control with a 3D-SLIP model." 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013. 17. Geyer, Hartmut, Andre Seyfarth, and Reinhard Blickhan. "Compliant leg behaviour explains basic dynamics of walking and running." Proceedings of the Royal Society B: Biological Sciences 273.1603 (2006): 2861-2867. 18. Seok, Sangok, et al. "Design principles for energy-efficient legged locomotion and implementation on the MIT cheetah robot." Ieee/asme transactions on mechatronics 20.3 (2014): 1117-1129. 19. Park, Hae-Won, Patrick M. Wensing, and Sangbae Kim. "High-speed bounding with the MIT Cheetah 2: Control design and experiments." The International Journal of Robotics Research 36.2 (2017): 167-192. 20. Bledt, Gerardo, et al. "Mit cheetah 3: Design and control of a robust, dynamic quadruped robot." 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018. 21. Katz, Benjamin, Jared Di Carlo, and Sangbae Kim. "Mini cheetah: A platform for pushing the limits of dynamic quadruped control." 2019 international conference on robotics and automation (ICRA). IEEE, 2019. <br/> ### Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation." arXiv preprint arXiv:1506.02438 (2015). **Problem Statement** Policy gradient methods has two challenge: 1) Large number of samples typically required &rarr; Use value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(λ) 2) Difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data &rarr; Use trust region optimization procedure for both the policy and the value function, which are represented by neural networks. **Preliminaries** Actor-Critic: One class of policy gradient algorithms. Use a value function rather than the empirical returns, obtaining an estimator with lower variance at the cost of introducing bias. Policy gradient step: A step in the policy gradient direction should increase the probability of better-than-average actions and decrease the probability of worse-than-average actions. Advantage function: By it’s definition Aπ (s, a) = Qπ (s, a) − V π (s), measures whether or not the action is better or worse than the policy’s default beha TD(λ): An learning algorithm to estimate of the value function. Variance of policy gradient estimate: The goal is to estimate _on-policy value function_. If the policy is stochastic, the action is different every time, and therefore the rewards are different. Thus, estimating value function using empirical rewards has high variance. However, approximate value function itself has low varinace. Bias of policy gradient estimate: If you use approximate value function, its estimation is biased toward initialization. However, empirical reward has low bias, since you get it from on-policy itself. In a nutshell, empirical rewards have low bias but high variance, whereas approximate value function have high bias but low variance. **Contributions** 1) Generalized Advantage Estimator (GAE): A family of policy gradient estimators that significantly reduce variance while maintaining a tolerable level of bias. Generalized advantage estimator has two parameters γ, λ which adjust the bias-variance tradeoff. 2) Use of a trust region optimization method for the value function. This is a robust and efficient way to train neural network value function. **Advantage Function Estimation** Parameter γ(gamma): Originally discount factor. But here we consider it as a "_variance reduction parameter_" in an undiscounted problem. But, in exchange, it introduces _bias_. Define ![](https://i.imgur.com/RoLLYVr.png) Policy gradient estimator: ![](https://i.imgur.com/zQiAvxB.png) Estimator of the advantage function: ![](https://i.imgur.com/tFOwyoB.png) The generalized advantage estimator GAE(γ, λ) is defined as the exponentially-weighted average of these k-step estimators. ![](https://i.imgur.com/Q4pLnjG.png) The generalized advantage estimator for 0 < λ < 1 makes a compromise betweenbias and variance, controlled by parameter λ. ![](https://i.imgur.com/Fd0D2ay.png) **Value Function Estimation** Solve nonlinear regression problem: ![](https://i.imgur.com/5aufJIS.png) where ![](https://i.imgur.com/y5fxyrm.png) <br/> ### Pratt, Jerry, et al. "Capture point: A step toward humanoid push recovery." 2006 6th IEEE-RAS international conference on humanoid robots. IEEE, 2006. **Capture Point**: A point on the ground where the robot can step to in order to bring itself to a complete stop **Capture Region**: The collection(set) of all Capture Points. **Base of Support**: The convex hull of the foot support area. **Capture State**: State in which the kinetic energy of the biped is zero and can remain zero with suitable joint torques. Note that the Center of Mass must lie above the Center of Pressure in a Capture State. The vertical upright “home position”[1] is an example of a Capture State. **Safe Feasible Trajectory**: Trajectory through state space that is consistent with the robot’s dynamics, is achievable by the robot’s actuators, and does not contain any states in which the robot has fallen. **Capture Point** (Formal): For a biped in state x, a Capture Point, P, is a point on the ground such that if the biped covers P (makes its Base of Support include P), either with its stance foot or by stepping to P in a single step, and then maintains its Center of Pressure to lie on P, then there exists a Safe Feasible Trajectory leading to a Capture State. <br/> ### Meduri, Avadesh, Majid Khadiv, and Ludovic Righetti. "Deepq stepper: A framework for reactive dynamic walking on uneven terrain." 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021. **Problem Statement** Reactive stepping and push recovery for biped robots is often restricted to flat terrains because of the difficulty in computing capture regions for nonlinear dynamic model. **Contribution** 1. Learn the 3D capture region for nonlinear dynamic models. 2. The DeepQ stepper computes optimal step locations for walking at different velocities using the 3D capture regions approximated by the action-value function. 3. The DeepQ stepper can handle non-convex terrain with obstacles, walk on restricted surfaces like stepping stones and recover from external disturbances for a constant computational cost. <br/> ### Koolen, Twan, et al. "Capturability-based analysis and control of legged locomotion, Part 1: Theory and application to three simple gait models." The international journal of robotics research 31.9 (2012): 1094-1113. N-step capturability: the ability of a legged system to come to a stop without falling by taking N or fewer steps <br/> ### Duan, Helei, et al. "Sim-to-real learning of footstep-constrained bipedal dynamic walking." 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022. **Problem Statement** In the real world however, the environment will often impose constraints on the feasible footstep locations, typically identified by perception systems. Unfortunately, most demonstrated RL controllers on bipedal robots do not allow for specifying and responding to such constraints. This missing control interface greatly limits the real-world application of current RL controllers. --> 1) We aim to maintain the robust and dynamic nature of learned gaits while also respecting footstep constraints imposed externally. We develop an RL formulation for training dynamic gait controllers that can respond to specified touchdown locations. 2) We use supervised learning to induce a transition model for accurately predicting the next touchdown locations that the controller can achieve given the robot’s proprioceptive observations **Contributions** Our control policy receives command inputs that constrain relative footsteps <br/> ### Duan, Helei, et al. "Learning Dynamic Bipedal Walking Across Stepping Stones." 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022. **Problem Statement** **Preliminaries** **Contributions** ### Blickhan, Reinhard. "The spring-mass model for running and hopping." Journal of biomechanics 22.11-12 (1989): 1217-1227. **Problem Statement** *Spring-mass model*: a massless spring attached to a point mass It describes the interdependency of mechanical parameters characterizing running and hopping of humans as a function of speed. Ex. Bouncing frequency, vertical displacement *Hopping at zero speed*: **Preliminaries** **Contributions**