Policy Gradient

# Policy Gradient [TOC] 分類 : * `Model-Free` * `Policy Based` * `on Policy` * `Discrete Action Space` * `Discrete/Continuous State Space` ## Introduction **Policy Gradient** 是一種 **Policy-Based** 的算法，與 **Value-Based** 的 **DQN** 不一樣的地方就是輸出不是期望獎勵，而是輸出機率，用 **Neural Network** 建立一個 **Model**，**Input** 為 **State** $s_t$，最後一層的 **Activation Function** 要改成 **Softmax**，這樣就能輸出機率並且所有機率加總為 1，**Model** 的參數為 $\theta$，用參數 $\theta$ 決定 **Action** 的 **Neural Network** 我們稱為 $\pi$。 ![image](https://hackmd.io/_uploads/BJpMlY3Mee.png) 以 $s_t$ 為 **Input**，$\pi$ 使用參數 $\theta$ 做出動作 $A_0$ 的機率就是$p_\theta(A_0|s_t)$，因為經過 **Softmax**，所以所有機率加總 $p_\theta(A_0|s_t)+p_\theta(A_1|s_t)=1$。 ![image](https://hackmd.io/_uploads/r1pOgY2zxe.png) 以 CartPole 來舉例，在$s_t$為 Input 的情況下，**Neural Network** $\pi$ 使用參數 $\theta$ 向左走($A_0$)的機率為 $p_\theta(A_0|s_t)$，ex.向左 **0.7**、向右 **0.3**，$s_t$ 為推車位置、速度和木棍的速度、角速度。我們可以把根 **Environment** 互動的過程記錄成一段軌跡(Trajectory)，可以寫成 : ## Trajectory Agent 與環境互動的方式如下 : ![image](https://hackmd.io/_uploads/By2NtuLQll.png) 上面的一個循環就是一次 **Step**，**Agent** 會一直循環值到環境或遊戲結束，一場遊戲稱為 **Episode**，一個 **Episode** 的 **Total Reward** 可以寫成 $R=\sum_{t=1}^Tr_t$，Agent 的目標就是獲取最大的 **Episode Reward**。將一個 **Episode** 的所有資訊組合起來就是 **Trajectory**(軌跡)，可以寫成 : $\tau^n=\{(s^n_1,a^n_1,r^n_1),...,(s^n_t,a^n_t,r^n_t)\}$ * $n$ : **Episode** * $t$ : **Step** 與環境互動一個 **Episode** 獲得 **Trajectory**，只要裡面的 **state、action、reward** 任一個不一樣，或是順序不一樣，就算是一個不同的 **Trajectory** ## 期望值 Overview 用機器學習常用的寫法來描述期望值，用骰子舉例 : $E_{x\sim p(x)}[g(x)]=\int^\infty_{-\infty}p(x)g(x)dx$ * $x$ : 隨機變數，可以當作骰子 1 ~ 6 * $p(x)$ : 隨機變數 $x$ 被 Sample 的機率密度函數，以骰子來說的話 x 不管輸入甚麼，輸出的機率都是 $\frac{1}{6}$(其他範例有可能因為 $x$ 的 Input 而不一樣)。 * $E_{x\sim p(x)}$ : 使用機率密度函數(**Probability Density Function**) $p(x)$ 來 **sample** 出 $x$ 的期望值。 * $g(x)$ : 代表隨機變數實際獲得的價值，但骰子的例子的話 $g(1)=1$，是一樣的。 * $E_{x\sim p(x)}[g(x)]$ : 代表了使用機率密度函數 $p(x)$ **sample** 出 $x$ 預計能夠獲得的期望值。 * $\int^\infty_{-\infty}p(x)g(x)dx$ : 前面算是一種表示式，這邊的就很像是實際的計算方法，積分的意義是面積，在現實面就是連續的加總，好消息是骰子不連續，也就是$x$ 為 1~6 總共 6 種情況，所以其實也可以寫成 $\sum_{x=1}^6 g(x)p(x)$。 $g(x)=x,p(x)=\frac{1}{6}$ ，所以如果要計算骰子的期望值的話 : $E_{x\sim p(x)}[g(x)]=\sum_{x=1}^6 g(x)p(x)=\sum_{x=1}^6 x\frac{1}{6}$ $=1\frac{1}{6}+2\frac{1}{6}+3\frac{1}{6}+4\frac{1}{6}+5\frac{1}{6}+6\frac{1}{6}=3.5$ 如果無法窮舉所有 $x$ ，或是無法獲得機率密度函數$p(x)$怎麼辦?，我們可以寫成取樣並近似的方法 : $\frac{1}{N}\sum_{n=1}^N V_n$ $V$ 為 Sample 出的骰子數值，因為 **Sample** 出甚麼數值本身就有被機率影響，出現次數多的機率自然高，反之亦然，所以可以把 $p(x)$ 拿掉，並取平均。 --- 程式測試 ```python! import numpy as np sum = 0 for i in range(6): data = i+1 sum += (1/6)*data print("Real Expected :",sum) sum = 0 num = 300000 sample = np.random.choice(6, num)+1 # sample 1~ 6 100 times for d in sample: #print(d) sum += d sum /= num print("Sample Expected :",sum) ``` ![image](https://hackmd.io/_uploads/rk9WsqLXeg.png) ## Expected Reward 我們的 **Agent** 是一個 **Neural Network**，會輸出每個動作的機率，也可以想成 **Agent** 覺得哪個動作獲得獎勵的機會，比如往左 0.7，往右 0.3，那就是 **Agent** 覺得往左可以獲得更多 Reward。一個 **Episode** 的 **Total Reward** 或 **Trajectory** 的總 **Reward** 可以寫成 : $R(\tau)=\sum_{t=1}^Tr_t$ 如果要計算 **Neural Network** 能夠獲得的期望獎勵，一樣參考 **MDP**，但是要帶入轉移機率，所以可以寫成 : $\hat{R}_\theta=E_{\tau\sim p_\theta(\tau)}[R(\tau)]=\sum_\tau R(\tau)p_\theta(\tau)$ * $\hat{R}_\theta$ : **Neural Network** 在參數 $\theta$ 的情況下所能獲得的期望獎勵 * $\tau$ : **Trajectory** (軌跡)，代表一個 **Episode** 與環境互動的資訊，包含 **state、action、reward**，按時間順序組合在一起。 * $p_\theta(\tau)$ : 機率密度函數(**Probability Density Function**)，**Neural Network** 使用參數 $\theta$ Sample 出隨機變數 $\tau$ 的機率。 * $E_{\tau\sim p_\theta(\tau)}[R(\tau)]$ : 使用 $p_\theta(\tau)$ 這個機率密度函數 **Sample** 出 $\tau$，以及 $\tau$ 所能獲得的期望值，也就是期望獎勵。 * $\sum_\tau R(\tau)p_\theta(\tau)$ : 窮舉所有可能的 $\tau$ ，並用機率$p_\theta(\tau)$ 乘上$\tau$ 的總獎勵 $R(\tau)$ 到這邊因為我們無法窮舉所有 $\tau$，所以要用取樣近似的方法，但是如果機率被拔掉了，**Neural Network** 就沒辦法參與計算了，所以要繼續推導簡化。 ## Trajectory Probability 剛剛有說期望獎勵(Expected Reward)如下 : $\hat{R_\theta}=\sum_\tau R(\tau)p_\theta(\tau)$ 展開 : $p_\theta(\tau)$ $=p(s_1)p_\theta(a_1|s_1)p(s_2|s_1,a_1)p_\theta(a_2|s_2)p(s_3|s_2,a_2)...$ 可以看到上面的 $p(\tau)$ 展開式三個部分重複循環的，個別如下 : * $p(s_1)$ : **Sample** 出環境初始化的第一個 **state** $s_1$ 的機率，可以看到他是沒有 $\theta$ 的，也就是這個機率由環境控制，**Neural Network** 無法參與。 * $p_\theta(a_1|s_1)$ : **Neural Network** 使用參數 $\theta$，將 $s_1$ 作為 **Input** ， **Sample** 出動作 $a_1$ 的機率 * $p(s_2|s_1,a_1)$ : 環境根據先前的$s_1$ 和 **Agent** 做出的 $a_1$ **Sample** 出 $s_2$ 的機率因為他是循環的，所以可以寫成累乘的形式。 $p(\tau)$ $=p(s_1) \prod_{t=1}^Tp_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)$ ## Policy Gradient 我們知道現在 **Deep Learning** 是使用 **Gradient Descent**(梯度下降)來優化參數，但 **Gradient Descent** 的優化是將 **Loss** 降低。我們如果希望 **Reward** 最大化的話就是用 **Gradient Ascent**，更新參數的方法要從$\theta_{t+1}=\theta_t-\nabla f(\theta_t)$ 改成 $\theta_{t+1}=\theta_t+\nabla f(\theta_t)$ 如果要更新 **Neural Network** 的參數 $\theta$，則對 **Expected Reward Function** 的 $\theta$ 做偏導 : $\hat{R_\theta}=\sum_\tau R(\tau)p_\theta(\tau)$ 微分 : $\nabla\hat{R}_\theta=\sum_\tau R(\tau)\nabla p_\theta(\tau)$ 因為只有 $p_\theta(\tau)$ 跟 $\theta$ 有關，所以只需要對$p_\theta$ 微分，但我們前面有提到一個問題，就是我們無法窮舉 $\tau$，所以還要繼續推導分子分母同乘 $p_\theta(\tau)$ $\large \nabla\hat{R}_\theta=\sum_\tau R(\tau)p_\theta(\tau)\frac{\nabla p_\theta(\tau)}{p_\theta(\tau)}$ 用公式代換 : $\nabla f(x)=f(x)\nabla \mathrm{log}f(x)$ $\large \nabla\hat{R}_\theta(\tau)=\sum_\tau R(\tau)p_\theta(\tau)\frac{p_\theta(\tau)\nabla \mathrm{log} p_\theta(\tau)}{p_\theta(\tau)}$ 簡化為 : $\nabla\hat{R}_\theta(\tau)=\sum_\tau R(\tau)p_\theta(\tau))\nabla\mathrm{log}p_\theta(\tau)$ 寫回期望值型式 : $E_{\tau\sim p_\theta(\tau)}[R(\tau)\nabla\mathrm{log}p_\theta(\tau)]=\int_{-\infty}^\infty p_\theta(\tau)R(\tau)\nabla \mathrm{log}p_\theta(\tau)$ 上面窮舉的形式我們絕對實現不了，所以寫成取樣近似的形式(可以把機率密度函數 $p_\theta(\tau)$ 拿掉 ) ，用 **Sample** 出的資料來計算平均值近似 : $E_{\tau\sim p_\theta(\tau)}[R(\tau)p_\theta(\tau)\nabla\mathrm{log}p_\theta(\tau)]\approx\frac{1}{N}\sum_{n=1}^N R(\tau^n)\nabla\mathrm{log}p_\theta(\tau^n)$ $p_\theta(\tau)$ 包含了 $\theta$ 相關的機率和 **Environment** 的機率，我們無法得知 **Enviroment** 相關的機率 $p(s_1)$、$p(s_{t+1}|s_t,a_t)$，且他跟 Agent 或參數 $\theta$ 是沒關係的，所以我們把它拔掉。 $\approx\frac{1}{N}\sum_{n=1}^N R(\tau^n)\nabla\mathrm{log}\prod_{t=1}^Tp_\theta(a^n_t|s^n_t)$ 因為 $\mathrm{log}(xy)=log(x)+log(y)$ ，所以 log 裡面的累乘可以寫成加總 : $\large=\frac{1}{N}\sum_{n=1}^N R(\tau^n)[\sum_{t=1}^{T_n} \nabla\mathrm{log}p_\theta(a^n_t|s^n_t)]$ 調換順序，可以把$R(\tau^n)$乘進去 : $\large=\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^{T_n} R(\tau^n)\nabla\mathrm{log}p_\theta(a^n_t|s^n_t)$ * $n$ : **Episode** * $t$ : **Step** 這個也很直覺，假如 **Reward** = 100 那我們只要提高機率 $p_\theta(a^n_t|s^n_t)$ 就能使 **Expected Reward** 增加。 $log(x)$ : ![image](https://hackmd.io/_uploads/H1oULW_mle.png) ## Update model 前面有提到要做 Gradient Ascent : * $\theta_{t+1} = \theta + \eta \nabla \hat{R}_\theta$ * $\nabla\hat{R}_\theta=\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^{T_n} R(\tau^n)\nabla\mathrm{log}p_\theta(a^n_t|s^n_t)$ 我們後面會使用 **Pytorch** 來實作，只要給 **Loss Function** 就能夠自動微分，再搭配優化器(ex.**Adam**)來訓練 **Neural Network**，最後的 **Loss Function** 就再加個負號就可以 **Gradient Ascent** 了 : $Loss=-\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^{T_n} R(\tau^n)\mathrm{log}p_\theta(a^n_t|s^n_t)$ ### Advantage 我們前面講的 $R(\tau)$ 可以換成其他 **Function**，在 **Policy Gradient** 系列的延伸會叫他 **Advantage Function**，在下圖寫成 $\Psi$，並可以替換成 1~6 的寫法 : ![image](https://hackmd.io/_uploads/H1TAPG_Xle.png) 第 1 個是我們一開始推導的，也就是 **Episode total reward**，第 2 個是從 **step** $t$ 開始累加到結束的 **reward**，很像 **DQN** 的 **TD-error**，只是少了折扣因子(Discount factor) $\gamma$ 第 3 則是第 2 個方法的 **baseline** 版本，其實就是讓 **Agent** 有比較的效果，使得計算結果有正有負，最簡單的方法就是取 **Episode** 的平均值，讓每個 **step** 得到的 **reward** 去減平均值 4~6 都是 **Actor-Critic** 使用的方法 ## CartPole Example ![1748694012588](https://hackmd.io/_uploads/Hy2utddzxe.gif) 這個 **Environment** 是一個不固定的木棍接在推車上，推車沿著無摩擦的軌道左右移動，目標是通過推車左右移動來保持平衡。 ### Action Space 這個 **Environment** 是離散動作空間(**Discrete Action Space**)，與環境互動需要的動作數量只要 1 * 0 : **Move Left** * 1 : **Move Right** ### Observation Space **State** 會回傳四個 **float**，分別為推車位置(**Cart Position**)、推車速度(**Cart Velocity**)、木棍角度(**Pole Angle**)、木棍角速度(**Pole Angular Velocity**) ![image](https://hackmd.io/_uploads/HkIEtjdzxx.png) --- ### Rewards 這個環境的目標是盡可能長時間保持木棍直立，每一個 **Step** 都會給予 **Reward** 1，包括木棍失去平衡倒下而中止。如果 `sutton_barto_reward=True` ，則每一個 **Step** 都會獲得 **Reward** 0，而中止時會獲得 -1 --- ### Starting State 起始的 **State** 會隨機將數值分配在 (-0.05,0.05) 的範圍。 --- ### Episode End Termination : * 木棍角度大於 +-$12$ * 推車超過邊緣 **Truncation** (有使用 `TimeLimit` 時) * **Episode** 的 Step 長度大於 500 (v0 為200) ## CartPole-v1 Policy Gradient **Github Code** : https://github.com/jason19990305/PolicyGradient.git ### Network ```python= import torch.nn as nn class Network(nn.Module): def __init__(self, args, hidden_layers=[64, 64]): super(Network, self).__init__() self.num_states = args.num_states self.num_actions = args.num_actions # Insert input and output sizes into hidden_layers hidden_layers.insert(0, self.num_states) hidden_layers.append(self.num_actions) # Create fully connected layers fc_list = [] for i in range(len(hidden_layers) - 1): num_input = hidden_layers[i] num_output = hidden_layers[i + 1] layer = nn.Linear(num_input, num_output) fc_list.append(layer) # Convert list to ModuleList for proper registration self.layers = nn.ModuleList(fc_list) self.relu = nn.ReLU() self.softmax = nn.Softmax(dim=1) # Softmax for action probabilities def forward(self, x): # Pass input through all layers except the last, applying ReLU activation for i in range(len(self.layers) - 1): x = self.relu(self.layers[i](x)) # The final layer outputs the Q-value directly (no activation) action_probability = self.softmax(self.layers[-1](x)) return action_probability ``` * 建立一個 **Neural Network** 的工具很多，這邊選擇用 **Pytorch**，`class Network(nn.Module)` 是繼承 `torch.nn.Module` 類別，建立起一個基本的 **Neural Network** * `args` 是從主程式的 **Hyperparameter** 傳進來的， `num_states` 和 `num_actions` 為 **Environment** 的 **state** 、 **action** 大小，`num_actions` 為動作的種類，實際輸出的動作只有一個純量，有些環境要求輸出多個 action，如六軸機械手臂，要多注意。 * `hidden_layers` 為 **List**，主要用來定義 **Hidden layer** 的維度，也可以理解為 **Output shape**，定義這一層要輸出多大的 **Vector**，**Input/Output layer** 則由 `num_states` 和 `num_actions` 決定。 ```python= # Create fully connected layers fc_list = [] for i in range(len(hidden_layers) - 1): num_input = hidden_layers[i] num_output = hidden_layers[i + 1] layer = nn.Linear(num_input, num_output) fc_list.append(layer) # Convert list to ModuleList for proper registration self.layers = nn.ModuleList(fc_list) self.relu = nn.ReLU() ``` * 上面是建立 **Linear Layer** 的實體(**DNN**，全連接)，並轉為 `ModuleList` 儲存。 * **Activation Function** 選擇用常見的 **ReLU** ```python= def forward(self, x): # Pass input through all layers except the last, applying ReLU activation for i in range(len(self.layers) - 1): x = self.relu(self.layers[i](x)) # The final layer outputs the Q-value directly (no activation) action_probability = self.softmax(self.layers[-1](x)) return action_probability ``` * **Overwrite** `forward()` 這個 **Function**，定義 **Policy Gradient** 的計算方式，`i` 代表的是當前層數，最後一層要的輸出要計算 **Softmax**，這樣才會輸出機率，因為 **Policy Gradient** 是 **Policy-base** 的算法，一定要使用機率計算 **loss**，並且所有動作的機率加總為 1。 * `self.relu(self.layers[i](x))` 為經過全連接層後再經過 **ReLU Function** --- ### Replay Buffer ```python= import numpy as np import torch class ReplayBuffer: def __init__(self, args): self.clear_batch() self.episode_count = 0 self.episode_batch = [] def clear_batch(self): self.s = [] self.a = [] self.r = [] self.s_ = [] self.done = [] self.count = 0 def store(self, s, a , r, s_, done): self.s.append(s) self.a.append(a) self.r.append(r) self.s_.append(s_) self.done.append(done) def to_episode_batch(self): s = torch.tensor(np.array(self.s), dtype=torch.float) a = torch.tensor(np.array(self.a), dtype=torch.int64) r = torch.tensor(np.array(self.r), dtype=torch.float) s_ = torch.tensor(np.array(self.s_), dtype=torch.float) done = torch.tensor(np.array(self.done), dtype=torch.float) self.episode_batch.append((s, a, r, s_, done)) self.episode_count += 1 self.clear_batch() def clear_episode_batch(self): self.episode_batch = [] self.episode_count = 0 ``` * `__init__` : 初始化 **Batch** `(s,a,r,s_,done)` 和 **Episode** (**Trajectory** $\tau$) * `clear_batch` : 清除 **Batch** 裡的資料，會在一個 **Episode** 結束後 **call** 這個 **Function** * `store` : 每一個 `step` 之後要儲存，**State、Action、Reward、Next State** 和 **Done** ，儲存到 Batch * `to_episode_batch` : 將當前 **Episode** 的 **Batch** 儲存到大的 **Buffer**，裡面以 **Episode** 為單位儲存 **Trajectory** $\tau$ * `clear_episode_batch` : 清除所有 **Trajectory** $\tau$，通常會在 **Neural Network Update** 結束後清除，因為機率改變了，所以 **Trajectory** 應該會完全不一樣，原本的資料已經不能評估現在的 **Neural Network** 了 ### Initialize ```python def __init__(self , args , env , hidden_layer_list=[64,64]): # Hyperparameter self.training_episodes = args.training_episodes self.advantage = args.advantage self.num_states = args.num_states self.num_actions = args.num_actions self.epochs = args.epochs self.lr = args.lr # Variable self.episode_count = 0 # other self.env = env self.replay_buffer = ReplayBuffer(args) # The model interacts with the environment and gets updated self.actor = Network(args , hidden_layer_list.copy()) print(self.actor) self.optimizer = torch.optim.Adam(self.actor.parameters() , lr = self.lr , eps=1e-5) ``` 面的是 Class `Agent` 的初始化，各變數說明如下 : * `training_episodes` : 要 **Sample** 的 **Trajectory** 的數量，有這個資料量後就進行 **Training** * `advantage` : 用於切換 **Advantage function** * `0` : **Total Reward** * `1` : **Reward Following** * `2` : **Baseline** * `num_states` : **State** 個數 * `num_actions` : 可輸出的 **Action** 種類，但要注意 * `epochs` : **Agent Rollout** 次數，也就是要跟環境互動多少次 **Episode** * `lr` : **Learning Rate** --- * `actor` : **Neural Network** * `optimizer` : **Adam** 優化器，並指定去優化 `actor` 這個 **Neural Network** 的參數 --- ### Choose action ```python= def choose_action(self, state): with torch.no_grad(): state = torch.unsqueeze(torch.tensor(state), dim=0) action_probability = self.actor(state).numpy().flatten() action = np.random.choice(self.num_actions, p=action_probability) return action ``` 這個 Function 用於 Training 階段用來 Sample Action，`with torch.no_grad()` 讓下面的計算不會加入 Backward，因為現在是與環境互動還沒有要計算 loss。 * `unsqueeze` : 可以把 Tensor 降低一個維度，ex. [1,2] -> [2] * `action_probability` : 給予 **Agent Input State**，輸出每個 Action 的機率 $p_\theta(a_t|s_t)$ * `np.random.choice` : 這個 **Function** 可以給予機率分布，然後根據機率輸出 0 ~ `num_actions` - 1 的數值，ex.prob=[0.9,0.1] ，輸出 0 的機率就為 90% --- ### Evaluate action ```python= def evaluate_action(self, state): with torch.no_grad(): # choose the action that have max q value by current state state = torch.unsqueeze(torch.tensor(state), dim=0) action_probability = self.actor(state) action = torch.argmax(action_probability).item() return action ``` 跟 `choose_action` 相似，但不一樣的地方是要無條件選擇機率最大的 **Action**，`torch.argmax` 可以回傳數值最大的 **Index** ### Train ```python= def train(self): episode_reward_list = [] episode_count_list = [] episode_count = 0 # Training loop for epoch in range(self.epochs): # reset environment state, info = self.env.reset() done = False while not done: action = self.choose_action(state) # interact with environment next_state , reward , terminated, truncated, _ = self.env.step(action) done = terminated or truncated self.replay_buffer.store(state, action, [reward], next_state, [done]) state = next_state self.replay_buffer.to_episode_batch() # Convert to episode batch if (epoch + 1)% self.training_episodes == 0 and epoch != 0: # Update the model self.update() self.replay_buffer.clear_episode_batch() # Clear the episode batch after updating if epoch % 10 == 0: evaluate_reward = self.evaluate(self.env) print("Epoch : %d / %d\t Reward : %0.2f"%(epoch,self.epochs , evaluate_reward)) episode_reward_list.append(evaluate_reward) episode_count_list.append(episode_count) episode_count += 1 # Plot the training curve plt.plot(episode_count_list, episode_reward_list) plt.xlabel("Episode") plt.ylabel("Reward") plt.title("Training Curve") plt.show() ``` * `replay_buffer.store` : 把這次 Step 的 $(s_t,a_t,r_t,s_{t+1})$ 還有結束狀態 done 加入到 Buffer 中 * `replay_buffer.to_episode_batch()` : 將這次的 Trajectory 資料加到 Episode Buffer * `update` : 每當互動次數到達 `training_episodes` 指定的次數後就進行 Neural Network 進行參數更新 * `clear_episode_batch` : 每 Buffer 中的所有資料清空 ### Update ```python= def update(self): loss = 0 base_line = 0 for batch in self.replay_buffer.episode_batch: s, a, r, s_, done = batch base_line += self.TotalReward(r) base_line /= self.replay_buffer.episode_count # Normalize baseline by number of episodes for batch in self.replay_buffer.episode_batch: s, a, r, s_, done = batch a = a.view(-1, 1) # Reshape action from (N) -> (N, 1) for gathering if self.advantage == 0: adv = self.TotalReward(r) elif self.advantage == 1: adv = self.RewardFollowing(r) elif self.advantage == 2: adv = self.RewardFollowing(r) adv = adv - base_line prob = self.actor(s).gather(dim=1, index=a) # Get action probability from the model log_prob = torch.log(prob + 1e-10) # Add small value to avoid log(0) loss += (adv * log_prob).sum() loss = - loss / self.replay_buffer.episode_count # Normalize loss by number of episodes self.optimizer.zero_grad() loss.backward() self.optimizer.step() ``` --- 第一個 for loop 用來計算 baseline 會計算以 Episode total reward 為單位的平均第二個 for loop 一開始會先調整 `a` 的維度，從 [N] -> [N,1]，然後計算 advantage，這邊有三種 1. **Total Reward** : $\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^{T_n}r_t^n$ 2. **Reward Following** : $\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^{T_n}\sum_{t'=t}^{T_n}r_{t'}$ 3. **Baseline** : $\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^{T_n}\sum_{t'=t}^{T_n}r_{t'}-b(s_t)$ 算完 **Advantage** 後要計算 **log Probability**，然後計算 **loss**，再來取平均，就會是我們前面說的公式 $Loss=-\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^{T_n} \Psi_t\mathrm{log}p_\theta(a^n_t|s^n_t)$ $\Psi$ 就是 **Advantage Function**，加上負號是為了 **Gradient Ascent**，然後就可以 `zero_grad()`、`backward()`、`step()` 根據 Loss 優化參數 ## Result **Hyperparameter** : ``` training_episodes = 3 advantage = 0 epochs = 5000 lr = 0.0001 num_states = 4 num_actions = 2 --------------- Network( (layers): ModuleList( (0): Linear(in_features=4, out_features=128, bias=True) (1): Linear(in_features=128, out_features=128, bias=True) (2): Linear(in_features=128, out_features=2, bias=True) ) (relu): ReLU() (softmax): Softmax(dim=1) ) ``` 結果 : ![image](https://hackmd.io/_uploads/BJbbUOcNgg.png) ![1750922797409](https://hackmd.io/_uploads/S1RSId94xg.gif) --- ``` training_episodes = 3 advantage = 1 epochs = 5000 lr = 0.0001 num_states = 4 num_actions = 2 --------------- Network( (layers): ModuleList( (0): Linear(in_features=4, out_features=128, bias=True) (1): Linear(in_features=128, out_features=128, bias=True) (2): Linear(in_features=128, out_features=2, bias=True) ) (relu): ReLU() (softmax): Softmax(dim=1) ) ``` ![image](https://hackmd.io/_uploads/B1QBDu5Ngl.png) ![1750923092348](https://hackmd.io/_uploads/SyWtPO5Vgl.gif) --- ## Conclusion 這次有實作三種 **Advantage Function**，我寫的 **Baseline** 是比較簡單的，還是要搭配 **Actor-Critic** 效果才會比較好，**Policy Gradient** 缺點還是很多，只是在 **CartPole** 這個環境比較簡單，而且這種 **Policy Gradient** 只能用在離散動作空間(**Discrete Action Space**)，如果想要用在連續動作空間，**on-policy** 的強化學習算法通常會搭配 **Normal Distribution** 來生成連續的動作，並且依然可以求得 Action 的機率，可以參考 **Vanila Policy Gradient**，在 **PPO** 也有類似的方法可以達成。還有一個需要注意的是這邊實作的方法只能用於輸出單一動作，如果這個環境要求輸出多個動作，如機械手臂需要同時給出多個 **Joint** 的角度的話就需要 **Agent** 產生多個動作，多個動作的機率必須相乘才會得到做出這組動作的機率，這樣才能後續計算。如果想要讓 **Policy Gradient** 效果更好，可以加入 **policy entropy**，就是 Deep Learning 分類問題常用的方法，單我們是計算 **Agent Action** 機率的 **Entropy**，只要機率越平均，**Entropy** 就越大，我們一樣在 **Loss - Entropy**，這樣他優化時就會希望每種動作越平均越好，以此防止進入局部最優。