# Hindsight Experience Replay(HER) [TOC] 可先參考 : [**DDPG**](https://hackmd.io/@bGCXESmGSgeAArScMaBxLA/SyVyCVxos) **Paper** : https://arxiv.org/abs/1707.01495 ## Introduction **HER** 主要是用來解決 **Reinforcement Learning** 的一個 **Sparse Reward**(稀疏獎勵) 問題,例如機器人領域中三維且連續的 **Environment**,傳統解決這種問題的方式是 **Shaped Reward**,也就是要為了 **Agent** 設計一套複雜的獎勵機制。 **HER(Hindsight Experience Replay)** 的核心就是讓 **Agent** 能從失敗中學習,失敗的資料也拿來使用,增加樣本使用率,並且不需要複雜的 **Reward Function**。 ## Multi-goal condition 這套算法能用在 **off-policy** 的 **Model-free** 算法,並且 **Environment** 回傳的 **State** 需要是 **Multi-goal** 的形式,其實就是來自 **Environment** 的 **State** $s_t$ 中,會多給一個 **Goal** $g$,每個 **Episode** 中的每個 **Step** ,**Goal** 都是一樣的,這個 **Environment** 的目標就是完成這個 **Goal** ## Hindsight Experience Replay 在 **Multi-goal** 的基礎下,假如 **Agent** 互動完整個 **Episode** 都沒有完成指定的 **Goal** 的話,我們可以給他修改這個 **Goal**,假裝她有完成。 ![image](https://hackmd.io/_uploads/rkSg7Du-We.png) 而 **HER** 的 **Paper** 論文測出來生成策略裡最好的是 **future strategy**,且在 `k=4` 的時候效果最好 ![image](https://hackmd.io/_uploads/H1fOXDOW-x.png) 而 future `k=4` 這個 **Strategy** 要怎麼生成資料 ? 首先 **Rollout** 階段先跟環境互動 **N** 個 **Episode** 並儲存在 **Buffer** 中,**Rollout** 結束後,走訪 **0~N Step** 的資料,當前的 **Step** 為 **i**,我們會在 **i~N** 之間的 **Achieve Goal** 中選隨機四個,做為新的 **Goal(Desired Goal)**,當前 **Step i** 的 **Goal** 替換成新的, **Reward** 也要重新計算,然後將這一組資料存回 **Replay Buffer**,接著一直走訪完所有 **Episode** 和 **Step**,然後就可以進行 **Network Update** ## Code #### Rollout 在訓練循環中,Agent 與環境互動並收集經驗。這裡展示了如何收集 `observation`, `achieved_goal`, `desired_goal` 等資訊。 ```python for i in range(self.env._max_episode_steps): a = self.choose_action(s) s_, r, done , truncated , _ = self.venv.step(a) done = np.zeros_like(r) self.total_steps += self.num_envs mb_state[i] = s["observation"].copy() mb_ach_goal[i] = s["achieved_goal"].copy() mb_des_goal[i] = s["desired_goal"].copy() mb_action[i] = a.copy() mb_reward[i] = r.reshape(-1,1).copy() mb_next_state[i] = s_["observation"].copy() mb_next_ach_goal[i] = s_["achieved_goal"].copy() mb_next_des_goal[i] = s_["desired_goal"].copy() mb_done[i] = done.reshape(-1,1).copy() # update state s = s_ ``` #### HER Sample 這是 **HER** 採樣的實作。它遍歷每個 Episode 和每個時間步,使用 `future` 策略從同一 **Episode** 的未來時間點隨機選取 `k` 個 `achieved_goal` 作為新的 `desired_goal`。 ```python def her_sample(self, mb_state, mb_action, mb_next_state, mb_ach_goal, mb_des_goal, mb_next_ach_goal, mb_next_des_goal): # HER sampling with 'future' strategy and k=4 k = 4 num_episode = mb_state.shape[0] num_steps = mb_state.shape[1] for ep in range(num_episode): for t in range(num_steps): # Get original transition data obs = mb_state[ep, t] actions = mb_action[ep, t] obs_next = mb_next_state[ep, t] ag = mb_ach_goal[ep, t] ag_next = mb_next_ach_goal[ep, t] # Sample k future indices from the same episode future_indices = np.random.randint(t, num_steps, size=k) # Get future achieved goals as new goals future_goals = mb_next_ach_goal[ep, future_indices] # Iterate through each future goal to compute reward and store transition for i in range(k): new_goal = future_goals[i] # Recompute reward for each new goal individually new_reward = self.env.unwrapped.compute_reward(ag_next, new_goal, None) done = (new_reward == 0).astype(np.float32) self.replay_buffer.store(obs, ag, new_goal, actions, new_reward, obs_next, ag_next, new_goal, done) ``` #### Choose Action 動作選擇策略結合了 **Gaussian Noise** (用於探索) 和 **Epsilon-Greedy** (隨機動作),這是 HER 論文中的標準做法。 ```python def choose_action(self,state): observation = state["observation"] desired_goal = state["desired_goal"] achieved_goal = state["achieved_goal"] # Apply normalization if enabled if self.use_normalization: observation = self.state_normalizer(observation, clip_range=self.clip_range) desired_goal = self.goal_normalizer(desired_goal, clip_range=self.clip_range) achieved_goal = self.goal_normalizer(achieved_goal, clip_range=self.clip_range) state = np.concatenate([observation,desired_goal,achieved_goal],axis=-1) state = torch.tensor(state, dtype=torch.float, device=self.device) s = torch.unsqueeze(state,0) # 1. Get deterministic action with torch.no_grad(): action = self.actor(s).cpu().numpy().squeeze() # 2. Add Gaussian noise # action += self.args.noise_eps * self.env_params['action_max'] * np.random.randn(*action.shape) # Adapted: self.var corresponds to noise_eps action += self.var * self.action_max * np.random.randn(*action.shape) # 3. Clip action action = np.clip(action, -self.action_max, self.action_max) # 4. Random actions (Epsilon-Greedy) random_actions = np.random.uniform(low=-self.action_max, high=self.action_max, size=self.action_dim) # 5. Choose if use the random actions # action += np.random.binomial(1, self.args.random_eps, 1)[0] * (random_actions - action) # Adapted: random_eps = 0.3 random_eps = 0.3 action += np.random.binomial(1, random_eps, 1)[0] * (random_actions - action) return action ``` ## Result **FetchReach** : ![FetchReach-v4_training_curve](https://hackmd.io/_uploads/HJadCwYWbl.png =60%x) **FetchPush** : ![FetchPush-v4_training_curve](https://hackmd.io/_uploads/BJa_0vF--x.png =60%x) **FetchPickAndPlace** : ![FetchPickAndPlace-v4_training_curve](https://hackmd.io/_uploads/ryTOAPYZZx.png =60%x) **FetchSlide** : ![FetchSlide-v4_training_curve](https://hackmd.io/_uploads/ry6dRDt-Wx.png =60%x) ### Conclusion **Hindsight Experience Replay (HER)** 是一項在機器人強化學習領域極具影響力的技術,它巧妙地解決了 **Sparse Reward** 的難題。 * **適用算法 (Compatible Algorithms)** * HER 可以與任何 **Off-policy** 的 RL 算法結合使用,例如 **DDPG**, **TD3**, **SAC**, **DQN** 等。 * 這是因為 **HER** 修改了 **Replay Buffer** 中的數據(改變了 **Goal**),而 **Off-policy** 算法天生就允許使用由不同策略或不同目標產生的數據進行訓練。 * **限制 (Limitations)** * **Multi-goal Environment**: HER 依賴於能夠替換 **Goal** 的機制,因此環境必須是 **Multi-goal** 的形式(即 **State** 中包含 **Desired Goal**)。 * **Goal Representation**: 需要能夠將 **Achieved Goal** 映射到 **Desired Goal** 的空間中。 * **優缺點 (Pros & Cons)** * **Pros**: * **解決 Sparse Reward**: 即使 **Agent** 從未成功,也能透過 **HER** 獲得獎勵訊號,從失敗中學習。 * **樣本效率高 (Sample Efficiency)**: 大幅提升了數據的利用率。 * **Cons**: * **Replay Buffer 膨脹**: 由於需要儲存額外的 **HER** 轉換數據,對記憶體的需求較大。 * **訓練複雜度**: 需要維護 **Goal** 的狀態和計算額外的 **Reward**。 * **對 Robotics 的幫助** * 在真實世界的機器人任務中,設計一個完美的 **Reward Function** 非常困難(**Reward Engineering**)。**HER** 允許我們只使用簡單的 **Binary Reward**(成功/失敗),就能訓練出複雜的操控策略(如推物體、夾取等),這讓機器人學習變得更加可行且通用。