# Subject
進行人型機器人各種不同種類的控制方式之討論,並建立一完整的基礎基於強化學習之構控制架構,同時使用此架構進行不同獎勵函數組合權重之討論;並在此基礎上加入一整合基於模型之軌跡規劃與強化學習的控制框架來改善人形機器人行走時的能量消耗
# Paper Review
## Model-based
Optimization-Based Control for Dynamic Legged Robots
: IEEE Transactions on Robotics,2024
### ++Model Prediction Control++ (MPC)
Terrain-adaptive, alip-based bipedal locomotion controller via model predictive control and virtual constraints
: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
MPC for Humanoid Gait Generation: Stability and Feasibility
: IEEE Transactions on Robotics, vol. 36, no. 4, pp. 1171-1188, Aug. 2020
Tailoring solution accuracy for fast whole-body model predictive control of legged robots.
: IEEE Robotics and Automation Letters (2024)
Inverse-dynamics mpc via nullspace resolution
: IEEE Transactions on Robotics (2023)
Animal Motions on Legged Robots Using Nonlinear Model Predictive Control
: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
### ++Whole Body Control++ (WBC)
Dynamic Locomotion For Passive-Ankle Biped Robots And Humanoids Using Whole-Body Locomotion Control
: The International Journal of Robotics Research, 2020
Control and evaluation of a humanoid robot with rolling contact joints on its lower body
: Frontiers in Robotics and AI,2023
Versatile locomotion planning and control for humanoid robots
: Frontiers in Robotics and AI, 2021, 8
### ++Trajectory Optimization++
Beyond inverted pendulums: Task-optimal simple models of legged locomotion
: IEEE Transactions on Robotics, 2024
### ++Multi-Contact Motion Planning++
A multicontact motion planning and control strategy for physical interaction tasks using a humanoid robot
: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
### ++MPPI++
Full-Order Sampling-Based MPC for Torque-Level Locomotion Control via Diffusion-Style Annealing
: arXiv preprint arXiv:2409.15610, 2024.
### Optumization Method Comparation
| | Trajectory Optimization| Model Prediction Control |
| -------- | -------- | -------- |
| **時間範圍** | 整個任務的運動軌跡 | 短期(如幾秒或幾步內)的控制輸出 |
| **應對環境變化**| 不適應,需重新計算 |適應,即時更新控制輸出 |
|**計算時間** |離線計算,無實時的時間成本 |即時進行高維度運算,時間成本較大 |
### Model Comparation
| | Linear Inverted Pendulum Model | Single Rigid Body Dynamic | Centroidal Dynamics | Whole-Body Dynamics |
|---------------------|-----------------------------|------------------------------|-------------------------------|-------------------------------|
| **模型複雜度** | 簡單,僅質心運動 | 較複雜,質心+角動量近似 | 中等,質心+總角動量 | 非常複雜,多剛體全身 |
| **模型精度** | 低,忽略高度與力矩 | 中~高,部分動態 | 高,能捕捉多接觸、角動量 | 非常高,可精準全身動態 |
| **適應能力** | 低,僅平面步行 | 中,部分複雜動作 | 高,多接觸/地形 | 極高,幾乎無限制 |
| **硬體需求** | 低,感測器簡單 | 中,需IMU與接觸力訊號 | 高,需精準感測 | 非常高,全身感測/高效能電腦 |
| **多接觸支持** | 差,僅單面/雙足 | 中,有限多接觸 | 良好,適用多點接觸 | 優秀,可同時處理多點/多種接觸 |
| **任務應用** | 步態生成、簡單平衡 | 跑步、跳躍、快速運動 | 多接觸運動、複雜步態 | 全身MPC、高難度運動規劃 |
## RL
### ++Motion Planning++
Learning bipedal walking on planned footsteps for humanoid robots
: 2022 IEEE-RAS 21st International Conference on Humanoid Robots(Humanoids).
Real-world humanoid locomotion with reinforcement learning
: Science Robotics, 2024
Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control.
: The International Journal of Robotics Research (2024)
Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots
: 2021 IEEE International Conference on Robotics and Automation (ICRA)
### ++Mimic/Trajectory Tracking++
Expressive whole-body control for humanoid robots.
: arXiv preprint arXiv:2402.16796, 2024.
Learning locomotion skills for cassie: Iterative design and sim-to-real
: Conference on Robot Learning. PMLR, 2020.
Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills
: arXiv preprint arXiv:2502.01143, 2025.
Adversarial motion priors make good substitutes for complex reward functions
: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Humanmimic: Learning natural locomotion and transitions for humanoid robot via wasserstein adversarial imitation
: 2024 IEEE International Conference on Robotics and Automation (ICRA).
Learning agile robotic locomotion skills by imitating animals
: arXiv preprint arXiv:2004.00784 (2020)
## Model-based + RL
Survey of Model-Based Reinforcement Learning:Applications on Robotics
: Journal of Intelligent & Robotic Systems, 2017
Optimizing bipedal maneuvers of single rigid-body models for reinforcement learning
: 2022 IEEE-RAS 21st International Conference on Humanoid Robots.
Reinforcement learning for robust parameterized locomotion control of bipedal robots
: 2021 IEEE International Conference on Robotics and Automation (ICRA).
### ++RL-MPC++
RL-augmented MPC Framework for Agile and Robust Bipedal Footstep Locomotion Planning and Control
: 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids)
Learning agile locomotion and adaptive behaviors via rl-augmented mpc
: 2024 IEEE International Conference on Robotics and Automation (ICRA).
RL + Model-Based Control: Using On-Demand Optimal Control to Learn Versatile Legged Locomotion
: IEEE Robotics and Automation Letters, Oct. 2023.
### ++RL + Instantaneous Capture Point++ (ICP)
Integrating Model-Based Footstep Planning with Model-Free Reinforcement Learning for Dynamic Legged Locomotion
: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
### ++RL-WBC++
Run Like a Dog: Learning Based Whole-Body Control Framework for Quadruped Gait Style Transfer
: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
### ++RL + Reduce Order Model++
Reinforcement Learning for Reduced-order Models of Legged Robots
: 2024 IEEE International Conference on Robotics and Automation (ICRA)
## Passive Dynamic Walking
Passive-Based Walking Robot
: IEEE Robotics & Automation Magazine, June 2007
Controlled symmetries and passive walking
: IEEE Transactions on Automatic Control, July 2005
Adaptive Passive Biped Dynamic Walking on Unknown Uneven Terrain
: 2024 IEEE International Conference on Robotics and Automation (ICRA)
Virtual Bipedal Passive Walking Control Based on Target Dynamics
: 2024 IEEE International Conference on Mechatronics and Automation (ICMA)
## Future works
### LLM with humanoid
- "Prompt, Plan, Perform: LLM-based Humanoid Control via Quantized Imitation Learning," 2024 ICRA
### Sim2real
- "Sim-to-Real Policy and Reward Transfer with Adaptive Forward Dynamics Model, 2023 ICRA
- Sim-to-real learning of all common bipedal gaits via periodic reward composition, 2021 ICRA
- "Sim-to-real learning of footstep-constrained bipedal dynamic walking." 2022 ICRA
# Method Comparation
## Model-based
- **Advantage**
1. 成本函數可直觀調整整體系統響應,達到更精確的控制目標與運動規劃。
2. 當模型準確且系統條件明確時,在穩定性和時效性上通常優於 RL,特別適合需要高可預測性與高安全性的應用場景。
- **Disadvantage**
1. 依賴於簡化或理想化的物理模型與假設,因此若模型與真實系統存在差異,會導致準確度下降。
2. 面對複雜環境或高動態場景(例如摩擦參數難以準確建模),模型基礎方法可能無法可靠應用。
3. 接觸力或非線性動力學要求高,導致在真實應用時計算延遲影響系統的即時性與穩定性。
---
## Model-Free(RL)
- **Advantage**
1. 無需依賴精確的物理模型,可以應對模型困難、未知或動態改變的環境。
2. 高度適應性,能夠透過反覆與環境互動學習複雜行為,處理隨機性與非線性系統。
3. 大幅減少對細緻動力學建模與系統設計的需求,端到端學習讓控制器設計更簡單。
- **Disadvantage**
1. 需要大量樣本進行訓練,通常在現實世界中難以直接實施,常仰賴模擬到現實(sim-to-real)技術。
2. 模擬器中的簡化致動器模型使 RL 策略容易產生在實體機器上難以複製的行為(如抖動、過高輸出等)。
3. 在未曾學習過的情境下表現可能不穩定,且缺乏安全保證。
4. 從零開始學習的策略容易產生不自然動作,對獎勵設計依賴高。
5. 訓練過程繁瑣、耗時,且生成策略高度依賴訓練設置,泛化能力有限。
---
## Model-based + Model-Free
- **Advantage**
1. 能夠結合模型方法的穩定性與強化學習的適應能力,彌合簡化模型與真實複雜系統間的差距。
2. 相較於僅基於模型的方法,能在如滑溜、可變形等困難地形上展現更佳穩定性與魯棒性。
3. 可將簡化模型產生的參考軌跡作為獎勵設計核心,加速學習並推廣至全模型機器人系統。
- **Disadvantage**
1. 受限於模型與策略整合的複雜度,對系統結構、控制策略與獎勵設計提出更高要求。
2. 性能表現依賴於模型、策略結合方式以及整體架構的設計,可能導致實作難度提升,開發成本增加。
# Control Framework
## Basic RL Control Loop

使用階層式的控制迴圈設計,高層級控制迴圈中強化學習策略的部分會以 50hz 的頻率進行動作的輸出,也就是關節的目標位置,接著在低層級控制迴圈的部分則是會接收來自策略輸出的關節目標位置並由 PD 控制器轉換為扭矩命令輸入至機器人模型進行控制。
#### 觀測空間 $𝒐_𝒕$ (Observation Space)
$$ 𝒐_𝑡=[ 𝒗_{𝑏𝑎𝑠𝑒}, 𝝎_{𝑏𝑎𝑠𝑒}, 𝒈_{𝑝𝑟𝑜𝑗}, 𝒗_{𝑐𝑚𝑑}, 𝒒, \dot{𝒒}, 𝒂_{𝑝𝑟𝑒𝑣}]. $$
其中 $𝒗_{𝑏𝑎𝑠𝑒},𝝎_{𝑏𝑎𝑠𝑒}$ 分別為基座在全域中線速度及角速度,$𝒈_{𝑝𝑟𝑜𝑗}$ 重力在機器人基座座標之投影,$𝒗_{𝑐𝑚𝑑}$ 為外部速度命令,$𝒒, \dot{𝒒}$ 分別為機器人各關節的位置及速度, $𝒂_{𝑝𝑟𝑒𝑣}$ 為策略上一刻之動作輸出。
---
#### 動作空間 $𝒂_𝒕$ (Action Space)
$$ 𝒂_𝑡= [ 𝑞_{1,𝑡}^{𝑡𝑎𝑟𝑔𝑒𝑡}, 𝑞_{2,𝑡}^{𝑡𝑎𝑟𝑔𝑒𝑡} , . . . , 𝑞_{n,𝑡}^{𝑡𝑎𝑟𝑔𝑒𝑡} ] $$
其中 $𝑞_{n,𝑡}^{𝑡𝑎𝑟𝑔𝑒𝑡}$ 表示第 𝑖 個關節在時間步 𝑡 的目標位置, 𝑛 則是驅動關節的總數量。
---
#### 獎勵函數 (Reward)
- **線速度追蹤獎勵**
$$
r_{lin\ vel} = \exp\left(-\frac{ \| \mathbf{v}_{cmd} - \mathbf{v}_{base} \|^2 }{\sigma^2}\right), \quad \sigma = 0.5
$$
- **角速度追蹤獎勵**
$$
r_{ang\ vel} = \exp\left(-\frac{ \| \mathbf{\omega}_{cmd} - \mathbf{\omega}_{base} \|^2 }{\sigma^2}\right), \quad \sigma = 0.5
$$
- **關節偏差懲罰**
$$
r_{pos\ pen} = - \sum_{i \in J} \left| q_i - q_i^{default} \right|
$$
- **基座傾斜懲罰**
$$
r_{tilt} = (g_{proj,x}^2 + g_{proj,y}^2)
$$
- **關節加速度懲罰**
$$
r_{acc\ pen} = - \sum_{i \in J} \left( \dot{q}_i \right)^2
$$
- **動作變化率懲罰**
$$
r_{acc\ pen} = - \sum_{i \in J} \left( a_k - a_k^{prev} \right)^2
$$
- **關節位置限制懲罰**
$$
r_{pos\ limits} = -\sum_{i \in J} \left[ \max(q_i^{min} - q_i, 0) + \max(q_i - q_i^{max}, 0) \right]
$$
- **關節力矩懲罰**
$$
r_{\text{torque}} = \sum_{i \in \text{J}} \tau_i^2
$$
- **終止懲罰**
$$
r_{termination\ pen} =
\begin{cases}
-200, & \text{if terminal condition met} \\
0, & \text{otherwise}
\end{cases}
$$
---
#### 策略架構

- **Actor-Critic 架構**
- **Actor 結構**:$[n_{input},\ 256,\ 128,\ 128,\ n_{output}]$
- **Critic 結構**:$[n_{input},\ 256,\ 128,\ 128,\ 1]$
- **激活函數 (Activation)**:ELU (Exponential Linear Unit)
- **學習演算法**:Proximal Policy Optimization (PPO)
- **學習率 (Learning rate)**:0.001
- **折扣因子 ($\gamma$)**:0.99
## Passive Mimic Control Framework

#### 追蹤觀測空間 $\mathbf{o}_{passive,t}$
$$
\mathbf{o}_{\text{passive}, t} = \left[\, \mathbf{e}_{\text{com}},\ \mathbf{e}_{\text{ee}},\ \mathbf{e}_{q},\ \mathbf{e}_{\dot{q}}\, \right]
$$
其中,
$$
\begin{aligned}
&\mathbf{e}_{com} = \mathbf{p}_c - \mathbf{p}_c^{ref}, \quad
\mathbf{e}_{ee} = \mathbf{p}_{ee} - \mathbf{p}_{ee}^{ref}, \\
&\mathbf{e}_q = \mathbf{q} - \mathbf{q}^{ref}, \quad
\mathbf{e}_{\dot{q}} = \dot{\mathbf{q}} - \dot{\mathbf{q}}^{ref}.
\end{aligned}
$$
---
#### 追蹤獎勵函數 $r_{passive}$
- **質心軌跡追蹤獎勵**:
$$
r_{com} = \exp\left(-\| \mathbf{p}_c - \mathbf{p}_c^{ref} \| \right)
$$
- **腳步軌跡追蹤獎勵**:
$$
r_{ee} = \exp\left(-\| \mathbf{p}_{ee} - \mathbf{p}_{ee}^{ref} \| \right)
$$
- **關節位置追蹤獎勵**:
$$
r_{q} = \exp\left(-\| \mathbf{q} - \mathbf{q}^{ref} \| \right)
$$
- **關節速度追蹤獎勵**:
$$
r_{\dot{q}} = \exp\left(-\| \dot{\mathbf{q}} - \dot{\mathbf{q}}^{ref} \| \right)
$$
# Matric
- **能量效率** :
1. 運輸成本 Cost of Transport(CoT)
$$
CoT_{Leg} = \frac{
\int_{0}^{T} \left( \sum_{i \in Leg} |\tau_i(t)\,\dot{\theta}_i(t)| \right)\,dt
}{
m \cdot g \cdot d
}
$$
2. 平均功率消耗 : $P_{avg}$
- **穩定性**:
1. 地面接觸時間
<div align="center">
<img src="https://hackmd.io/_uploads/SJpufwvLxe.png" width="70%"/>
</div>
2. 地面反作用力
<div align="center">
<img src="https://hackmd.io/_uploads/SyFFXwwUex.png" width="70%"/>
</div>
- **靈活性與適應性**:
1. 旋轉半徑
2. 擾動恢復能力。
- **任務評估** :
1. 速度追蹤誤差
2. 軌跡誤差
- **關節穩定性** :
1. 平均功率輸出
2. 一個步態週期內力矩分佈
##