Proximal Policy Optimization

# Proximal Policy Optimization [TOC] 分類 : * `Model-Free` * `Policy Based` * `on-Policy` * `Discrete Action Space` * `Discrete/Continuous State Space` ## Introduction 之前提過 [A2C](https://hackmd.io/@bGCXESmGSgeAArScMaBxLA/S1kDKWNSlg) 是 **Actor-Critic** 版的 **Policy Gradient**，但 **A2C** 還是沒有解決資料使用率低的缺點，資料都是只能 **Train** 一次就丟棄，而 **PPO(Proximal Policy Optimization)** 就是為了能同一份資料使用多次為目標，**PPO** 使用 **Importance Sampling** 讓 **Actor** 因為 **Neural Network** 的優化，**Output** 的機率改變了，也能正確估算這個 **Neural Network** 的 **Expected Reward** > **PPO** 也可以不用 **Critic (Value Function)** 來 **Training**，只是效果會比較差參考資料是來自李弘毅老師的課程 :　https://www.youtube.com/watch?v=OAKAZhFmYoI&list=PLJV_el3uVTsODxQFgzMzPLa16h6B8kWM_&index=3 ## 期望值複習如何計算骰子的期望值 : ![image](https://hackmd.io/_uploads/BkgqbeMuel.png) 假設有兩個 **Probability Distribution** $p(x)$ 和 $q(x)$，隨機變數 $x$ 和 **Function** $f(x)$ 如果要在機率分佈(機率密度) $p(x)$ 計算 $f(x)$ 的期望值的話可以寫成 : $\mathbb{E}_{x\sim p(x)}[f(x)]=\int_{-\infty}^\infty p(x)f(x) dx$ 上面的就是窮舉所有的 $x$，然後獲得他的值 $f(x)$，然後乘上機率並加總( $x$ 跟骰子一樣是離散的話) 也可以寫成 **sample** 的形式 : $\mathbb{E}_{x\sim p(x)}[f(x)]\approx \frac{1}{N}\sum^N_{i=1} f(x_i)$ ## Importance Sampling 期望值算法 : $\mathbb{E}_{x\sim p(x)}[f(x)]=\int_{-\infty}^\infty p(x)f(x) dx\approx \frac{1}{N}\sum^N_{i=1} f(x_i)$ 上面寫的 $x\sim p(x)$ 代表的就是使用 $p(x)$ 這個 **Distribution** 來 **Sample** 出 $x$，我們計算上主要就是利用 **Sample** 資料然後估計出期望值，但假如我們 **Sample** 一個 $x_i$，我們知道在兩種 **Distribution** 下 **Sample** 出 $x$ 的機率 $p(x)$ 和 $q(x)$，然後假設我們不能用 $p(x)$ 這個 **Distribution** 來 **Sample** $x$，只能用 $q(x)$ 來 **Sample**，這個時候就要用 **Importance Sampling** 推導 : $\mathbb{E}_{x\sim p(x)}[f(x)]=\int_{-\infty}^\infty p(x)f(x) dx$ 對積分項乘 $\frac{q(x)}{q(x)}$ $\mathbb{E}_{x\sim p(x)}[f(x)]=\int_{-\infty}^\infty q(x)[\frac{p(x)}{q(x)}]f(x) dx$ 那麼我們就可以寫成 : $\mathbb{E}_{x\sim q(x)}[\frac{p(x)}{q(x)}f(x)]=\int_{-\infty}^\infty q(x)[\frac{p(x)}{q(x)}]f(x) dx\approx \frac{1}{N}\sum^N_{i=1} \frac{p(x_i)}{q(x_i)}f(x_i)$ 可以發現我們只需要使用 $q(x)$ 來 **Sample** 出 $x_i$ ，然後獲得兩個機率之間的比值 $\frac{p(x)}{q(x)}$，我們就可以估測出 $p(x)$ 的期望值，用平均值近似，寫的完整一點的話 : $\mathbb{E}_{x\sim p(x)}[f(x)]=\mathbb{E}_{x\sim q(x)}[\frac{p(x)}{q(x)}f(x)]=\int_{-\infty}^\infty q(x)[\frac{p(x)}{q(x)}]f(x) dx\approx \frac{1}{N}\sum^N_{i=1} \frac{p(x_i)}{q(x_i)}f(x_i)$ **Example** : 我用 **Normal Distribution** 來舉例 * $p(x)$ : $\mu=3.5$、$\sigma=0.9$ * $q(x)$ : $\mu=1$、$\sigma=1$ 這兩個 **Distribution** 的 **PDF(Probability Density Function)** 如下 : ![image](https://hackmd.io/_uploads/r121GNG_xx.png) 我這邊舉利用的 **Function** 是 $f(x)=\frac{1}{1+e^{-x}}$ ![image](https://hackmd.io/_uploads/HyVez4zuel.png) 下面的紅線是 $p(x)$ Sample $x_i$ 統計出的期望值，藍線則是用 **Importance** **Sampling** 計算出來的，x 軸是 **Sample** 的資料量，資料越多越準確 ![image](https://hackmd.io/_uploads/Hk2gG4G_gg.png) ## Policy Optimization 複習 **Policy Gradient** 中計算 **Expected Reward** 方式如下 : $\large E_{(s_t,a_t)\sim\pi_{\theta}}[A^\theta(s_t,a_t)\nabla\mathrm{log}p_\theta(a^n_t|s^n_t)]$ 前面提到的假設 $p(x)$ 無法 **Sample** 資料，只能用 $q(x)$ ，然後用 Importance Sampling 來估算 $p(x)$ 的期望值，這邊說的 $p(x)$ 其實就是會一直被優化的 **Actor** 本體，而 $q(x)$ 則是當初與環境互動的 **Actor**，因為 $p(x)$ 被更新參數後，如果再去與環境互動，那肯定不會是一樣的 **Trajectory**，為了讓 Actor 能多次更新，不用再重新 **Sample** 資料，才使用 **Importance Sampling**，後面會一直被更新的參數(權重)表示為 $\theta$，與環境互動的 **Actor** 參數為 $\theta_{old}$ 加入 **Importance Sampling** : $\large \nabla\hat{R_\theta}=E_{(s_t,a_t)\sim p_{\theta_{old}}}[\frac{p_\theta(s_t,a_t)}{p_{\theta_{old}}(s_t,a_t)}A^{\theta_{old}}(s_t,a_t)\nabla\mathrm{log}p_{\theta}(a^n_t|s^n_t)]$ $E_{(s_t,a_t)\sim p_{\theta_{old}}}$ 代表我們是從 $p_{\theta_{old}}$ **Sample** 出 $s_t,a_t$，所以 **Advantage Function** 也是寫成上標 $\theta_{old}$，因為我們是用原本與環境互動的參數來獲取資料來計算的，$p_\theta(s_t,a_t)$ 代表的是 **Neural Network Sample** 出 $s_t$ 和 $a_t$ 的機率，但$s_t$ 的機率不是我們能控制的，可以把它拆開成 $p_\theta(a_t|s_t)p_\theta(s_t)$ : $\large \nabla\hat{R_\theta}=E_{(s_t,a_t)\sim p_{\theta_{old}}}[\frac{p_\theta(a_t|s_t)p_\theta(s_t)}{p_{\theta_{old}}(a_t|s_t)p_{\theta_{old}}(s_t)}A^{\theta_{old}}(s_t,a_t)\nabla\mathrm{log}p_{\theta}(a^n_t|s^n_t)]$ 而 $p_\theta(s_t)$ 是代表在參數 $\theta$ 的情況下 Sample 出 $s_t$ 的機率，這個基本上 Actor 沒辦法控制，也很難算，並假設 $p_\theta(s_t)$ 和 $p_{\theta_{old}}(s_t)$ 的機率是差不多的，所以忽略掉。 $\large \nabla\hat{R_\theta}=E_{(s_t,a_t)\sim p_{\theta_{old}}}[\frac{p_\theta(a_t|s_t)}{p_{\theta_{old}}(a_t|s_t)}A^{\theta_{old}}(s_t,a_t)\nabla\mathrm{log}p_{\theta}(a^n_t|s^n_t)]$ 使用 $\nabla f(x)=f(x)\nabla \mathrm{log}f(x)$ 代換掉 $p_\theta(s_t,a_t)\nabla\mathrm{log}p_\theta(a^n_t|s^n_t)$，就可以寫成 : $\large \nabla\hat{R_\theta}=E_{(s_t,a_t)\sim p_{\theta_{old}}}[\frac{p_\theta(a_t|s_t)}{p_{\theta_{old}}(a_t|s_t)}A^{\theta_{old}}(s_t,a_t)]$ 這個就是我們的目標函數，原本上面還會留著 $\nabla p_\theta(s_t,a_t)$，我們知道最後要取 **Gradient**，**Pytorch** 會幫我們處理，這邊可以先拿掉 ## Clipped Surrogate Objective 在 **PPO** 論文中有提到如果像前面說的那樣對 **Neural Network** 進行多次更新，效果是非常糟的，他們有兩種方法可以解決這個問題，分別是 **Clip** 方法和 **KL** 懲罰係數，這次用 **Clip** 來實作，而且也比較簡單。論文中沒有使用 **clipping or penalty** 的結果和其他方法的比較結果 : ![image](https://hackmd.io/_uploads/SJWoA7IOxx.png =60%x) 我們將不同參數的比值寫成 : $\large r_t(\theta)=\frac{p_\theta(a_t|s_t)}{p_{\theta_{old}}(a_t|s_t)}$ 寫成 **Loss Function** 要加負號 : $\large Loss=-\sum^T_{t=1} r_t(\theta)A^{\theta_{old}}(s_t,a_t)$ 為了防止 $r_t(\theta)$ 過大或過小，影響更新的穩定度，**PPO2** 使用 **Clipping** 的技巧 : $\large Loss=-\sum^T_{t=1} \mathrm{min}[r_t(\theta)A^{\theta_{old}}(s_t,a_t) , \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A^{\theta_{old}}(s_t,a_t)]$ 寫起來很長，但其實就是沒對 $r_t(\theta)$ **Clipping** 和 **Clipping** 的取一個比較小的，$\epsilon$ 是一個參數，在論文中說 $0.2$ 是最好的。 1. $\large r_t(\theta)A^{\theta_{old}}(s_t,a_t)$ 2. $\large \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A^{\theta_{old}}(s_t,a_t)$ 論文原圖 : ![image](https://hackmd.io/_uploads/HyaXpQLOex.png) 首先在更新尚未開始前 $r_t(\theta) = 1$ ，因為兩邊的參數是一致的，經過更新後 $r_t(\theta)$ 就會變大或變小，下面解釋不同情況下 $r_t(\theta)A$ 的變化 : 1. 如果 **Advantage** 為正 : 增加 $p_\theta(a_t|s_t)$ 的機率，多次的更新後，數值會逐漸往右上移動，當 $r_t(\theta)$ 超過 $1+\epsilon$ 就會被 **Clipping**，這樣機率就不會進一步更新大極端值 2. 如果 **Advantage** 為負 : 減少 $p_\theta(a_t|s_t)$ 的機率，多次的更新後，數值會逐漸往左上移動，當 $r_t(\theta)$ 超過 $1-\epsilon$ 就會被 **Clipping** 李弘毅老師還有把 **Clipping** 和沒 **Clipping** 的線畫出來，可以方便理解 : ![image](https://hackmd.io/_uploads/H1IWmEIdel.png) ![image](https://hackmd.io/_uploads/SynZmEI_lg.png) 他有把沒 **Clipping** 的數值畫出來(綠線)，但因為我們要取較小的所以就會變成上面的紅線那樣 ## Update Model 我們把所有計算 **Loss Function** 相關的公式都列出來 **Update Actor** : $\large r_t(\theta)=\frac{p_\theta(a_t|s_t)}{p_{\theta_{old}}(a_t|s_t)}$ $\large Loss=-\sum^T_{t=1} \mathrm{min}[r_t(\theta)A^{\theta_{old}}(s_t,a_t) , \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A^{\theta_{old}}(s_t,a_t)]$ $A^{\theta_{old}}(s_t,a_t)=r_t+\gamma V^\pi(s_{t+1})-V^\pi(s_t)$ **Update Critic** : $Loss = MSE(V^\pi(s_t),r_t+\gamma V^\pi(s_{t+1}))$ **PPO** 論文中描述的 **Advantage Function** 是用 **GAE(Generalized Advantage Estimation)**，但我們就先用 **A2C** 的 **Actor-Critic base line** 算法，這樣比較簡單，之後再導入 **GAE**。 ## CartPole Example ```pythonn= --------------- evaluate_freq_steps = 2000.0 mini_batch_size = 32 max_train_steps = 200000.0 batch_size = 512 epochs = 5 epsilon = 0.2 gamma = 0.99 lr = 0.0001 num_states = 4 num_actions = 2 --------------- [4, 128, 128, 1] Actor( (layers): ModuleList( (0): Linear(in_features=4, out_features=128, bias=True) (1): Linear(in_features=128, out_features=128, bias=True) (2): Linear(in_features=128, out_features=2, bias=True) ) (relu): ReLU() (softmax): Softmax(dim=1) ) Critic( (layers): ModuleList( (0): Linear(in_features=4, out_features=128, bias=True) (1): Linear(in_features=128, out_features=128, bias=True) (2): Linear(in_features=128, out_features=1, bias=True) ) (tanh): Tanh() ) --------- ``` ![image](https://hackmd.io/_uploads/HkU2_pOOll.png) ![1754226307544](https://hackmd.io/_uploads/S1-hAAnvxx.gif) ## Acrobot Example ```python --------------- evaluate_freq_steps = 2000.0 mini_batch_size = 16 max_train_steps = 200000.0 batch_size = 64 epochs = 10 epsilon = 0.2 gamma = 0.99 lr = 0.0001 num_states = 6 num_actions = 3 --------------- [6, 128, 128, 1] Actor( (layers): ModuleList( (0): Linear(in_features=6, out_features=128, bias=True) (1): Linear(in_features=128, out_features=128, bias=True) (2): Linear(in_features=128, out_features=3, bias=True) ) (relu): ReLU() (softmax): Softmax(dim=1) ) Critic( (layers): ModuleList( (0): Linear(in_features=6, out_features=128, bias=True) (1): Linear(in_features=128, out_features=128, bias=True) (2): Linear(in_features=128, out_features=1, bias=True) ) (tanh): Tanh() ) --------- ``` ![image](https://hackmd.io/_uploads/r1zmxTO_eg.png) ![1755004989561](https://hackmd.io/_uploads/S1Q8gpOuxx.gif) ## Code **Github Code** : https://github.com/jason19990305/PPO.git ### Actor-Critic #### Actor ```python= import torch.nn as nn class Actor(nn.Module): def __init__(self, args, hidden_layers=[64, 64]): super(Actor, self).__init__() self.num_states = args.num_states self.num_actions = args.num_actions # Insert input and output sizes into hidden_layers hidden_layers.insert(0, self.num_states) hidden_layers.append(self.num_actions) # Create fully connected layers fc_list = [] for i in range(len(hidden_layers) - 1): num_input = hidden_layers[i] num_output = hidden_layers[i + 1] layer = nn.Linear(num_input, num_output) fc_list.append(layer) # Convert list to ModuleList for proper registration self.layers = nn.ModuleList(fc_list) self.relu = nn.ReLU() self.softmax = nn.Softmax(dim=1) # Softmax for action probabilities def forward(self, x): # Pass input through all layers except the last, applying ReLU activation for i in range(len(self.layers) - 1): x = self.relu(self.layers[i](x)) # The final layer outputs action probabilities action_probability = self.softmax(self.layers[-1](x)) return action_probability ``` * `args` 適從主程式的 **Hyperparameter** 傳進來的，`num_state` 和 `num_actions` 為 **Environment** 要求的 **state**、**action** Input 數量(種類、維度)，`num_actions` 為動作的種類，也就是 **Agent** 能夠選擇的動作種類。 * `hidden_layers` 為 **list**，主要用來定義 **Hidden layer** 的維度，也可以理解為每一層的 **Output shape**，**Input**、**Output** **layer** 則由 `num_states` 和 `num_actions` 來決定 #### Critic ```python= class Critic(nn.Module): def __init__(self, args,hidden_layers=[64,64]): super(Critic, self).__init__() self.num_states = args.num_states self.num_actions = args.num_actions # add in list hidden_layers.insert(0,self.num_states) hidden_layers.append(1) print(hidden_layers) # create layers layer_list = [] for i in range(len(hidden_layers)-1): input_num = hidden_layers[i] output_num = hidden_layers[i+1] layer = nn.Linear(input_num,output_num) layer_list.append(layer) # put in ModuleList self.layers = nn.ModuleList(layer_list) self.tanh = nn.Tanh() def forward(self,x): for i in range(len(self.layers)-1): x = self.tanh(self.layers[i](x)) # predicet value v_s = self.layers[-1](x) return v_s ``` **PPO** 的 **Critic** 是狀態價值函數(**state value function**)，所以輸入只有 **state**，輸出就是單一的 **value**，所以可以看到最後一層維度固定為１，**Critic** 的目的是根據 **state** 來預測 **Expected Reward**，所以是 **Regression** 問題，最後一層不會有 **Activation Function** ### Replay Buffer ```python= class ReplayBuffer: def __init__(self, args): self.clear_batch() def clear_batch(self): self.s = [] self.a = [] self.r = [] self.s_ = [] self.done = [] self.count = 0 def store(self, s, a , r, s_, done): self.s.append(s) self.a.append(a) self.r.append(r) self.s_.append(s_) self.done.append(done) self.count += 1 def numpy_to_tensor(self): s = torch.tensor(np.array(self.s), dtype=torch.float) a = torch.tensor(np.array(self.a), dtype=torch.int64) r = torch.tensor(np.array(self.r), dtype=torch.float) s_ = torch.tensor(np.array(self.s_), dtype=torch.float) done = torch.tensor(np.array(self.done), dtype=torch.float) self.clear_batch() return s, a, r, s_, done ``` * `store()` : 儲存 state、action、reward、next_state、done * `numpy_to_tensor()` : 將儲存的資料轉換成 tensor 用於訓練神經網路，因為是 on-policy 的算法，資料不能重複使用，所以要使用 `clear_batch()` 將資料清空。 ### Initialize ```python= def __init__(self , args , env , hidden_layer_list=[64,64]): # Hyperparameter self.max_train_steps = args.max_train_steps self.evaluate_freq_steps = args.evaluate_freq_steps self.mini_batch_size = args.mini_batch_size self.num_actions = args.num_actions self.num_states = args.num_states self.batch_size = args.batch_size self.epsilon = args.epsilon self.epochs = args.epochs self.gamma = args.gamma self.lr = args.lr # Variable self.episode_count = 0 self.total_steps = 0 self.evaluate_count = 0 # other self.env = env self.env_eval = copy.deepcopy(env) self.replay_buffer = ReplayBuffer(args) # The model interacts with the environment and gets updated continuously self.actor = Actor(args , hidden_layer_list.copy()) self.critic = Critic(args , hidden_layer_list.copy()) print(self.actor) print(self.critic) self.optimizer_actor = torch.optim.Adam(self.actor.parameters(), lr=self.lr, eps=1e-5) self.optimizer_critic = torch.optim.Adam(self.critic.parameters(), lr=self.lr, eps=1e-5) ``` * `max_train_steps` : **Training loop** 去 **Sample** 資料的最大 **Step** 次數 * `evaluate_freq_steps` : 評估 Agent 的頻率 * `batch_size` : 獲得多少資料後執行參數更新 * `mini_batch_size` : 隨機從 batch 取這個數量的資料來計算 Loss * `gamma` : 計算 **TD-Error** 的 **Discount Factor** $\gamma$ * `epsilon` : 用來 **clip** 的參數，把$r_t(\theta)A^\theta(s_t,a_t)$ * `epochs` : PPO Update 參數的次數 * `lr ` : **Learning rate** --- * `actor` : 用來與環境互動的 **Neural Network** * `critic` : 用來預測 **Expected Reward** 和計算 **Advantage** 的 **Neural Network** * `optimizer_actor` : Adam 優化器，指定優化 Actor 的參數 * `optimizer_critic` : Adam 優化器，指定優化 Critic 的參數 ### Choose action ```python= def choose_action(self, state): state = torch.tensor(state, dtype=torch.float) with torch.no_grad(): s = torch.unsqueeze(state, dim=0) action_probability = self.actor(s).numpy().flatten() action = np.random.choice(self.num_actions, p=action_probability) return action ``` 根據 Actor 輸出對於每種動作的 **Probability**，然後透過 `np.random.choice` 根據機率來選擇動作 ### Evaluate action ```python= def evaluate_action(self, state): state = torch.tensor(state, dtype=torch.float) with torch.no_grad(): s = torch.unsqueeze(state, dim=0) action_probability = self.actor(s) action = torch.argmax(action_probability).item() return action ``` 與 `choose_action` 相似，但這次是選擇機率最大的動作來執行 ### Train ```python= def train(self): time_start = time.time() epoch_reward_list = [] epoch_count_list = [] epoch_count = 0 # Training loop while self.total_steps < self.max_train_steps: # reset environment s, info = self.env.reset() while True: a = self.choose_action(s) # interact with environment s_ , r , done, truncated, _ = self.env.step(a) done = done or truncated # stoare transition in replay buffer self.replay_buffer.store(s, a, [r], s_, [done]) # update state s = s_ if self.replay_buffer.count >= self.batch_size: self.update() epoch_count += 1 if self.total_steps % self.evaluate_freq_steps == 0: self.evaluate_count += 1 evaluate_reward = self.evaluate(self.env_eval) epoch_reward_list.append(evaluate_reward) epoch_count_list.append(epoch_count) time_end = time.time() h = int((time_end - time_start) // 3600) m = int(((time_end - time_start) % 3600) // 60) second = int((time_end - time_start) % 60) print("---------") print("Time : %02d:%02d:%02d"%(h,m,second)) print("Training epoch : %d\tStep : %d / %d"%(epoch_count,self.total_steps,self.max_train_steps)) print("Evaluate count : %d\tEvaluate reward : %0.2f"%(self.evaluate_count,evaluate_reward)) self.total_steps += 1 if done or truncated : break epoch_count += 1 # Plot the training curve plt.plot(epoch_count_list, epoch_reward_list) plt.xlabel("Epoch") plt.ylabel("Reward") plt.title("Training Curve") plt.show() ``` 迴圈的最外層就是 **Episode**，內層就是 **Step**，跳出的條件是 `total_steps < max_train_steps` ，內層就是一直與環就互動，並儲存資料到 **ReplayBuffer**，只要資料達到 `batch_size` 就會執行參數更新，**Actor-Critic** 會再 `self.update()` 被優化 ### Update ```python= def update(self): s, a, r, s_, done = self.replay_buffer.numpy_to_tensor() a = a.view(-1, 1) # Reshape action from (N) -> (N, 1) for gathering # get target value and advantage with torch.no_grad(): # current value value = self.critic(s) # next value next_value = self.critic(s_) # TD-Error target_value = r + self.gamma * next_value * (1 - done) # baseline advantage adv = target_value - value # Calculate old log probability old p_\theta(a|s) old_prob = self.actor(s).gather(dim=1, index=a) # Get action probability from the model old_log_prob = torch.log(old_prob + 1e-10) # Add small value to avoid log(0) for i in range(self.epochs): for j in range(self.batch_size//self.mini_batch_size): index = np.random.choice(self.batch_size, self.mini_batch_size, replace=False) # Update Actor new_prob = self.actor(s[index]).gather(dim=1, index=a[index]) # Get action probability from the model log_prob = torch.log(new_prob + 1e-10) # Add small value to avoid log(0) ratio = torch.exp(log_prob - old_log_prob[index]) # Calculate the ratio of new and old probabilities p1 = ratio * adv[index] p2 = torch.clamp(ratio, 1 - self.epsilon, 1 + self.epsilon) * adv[index] actor_loss = -torch.min(p1, p2) # Calculate loss actor_loss = actor_loss.mean() # Mean loss over the batch self.optimizer_actor.zero_grad() actor_loss.backward() # Backpropagation self.optimizer_actor.step() # Update Critic value = self.critic(s[index]) critic_loss = F.mse_loss(value, target_value[index]) # Mean Squared Error loss self.optimizer_critic.zero_grad() critic_loss.backward() # Backpropagation self.optimizer_critic.step() ``` 1. 從 **ReplayBuffer** 取出資料 2. 事先計算 $V^\pi(s_t)$ 和 $V^\pi(s_{t+1})$ 3. 事先計算 **Target value** **TD-Error** : $r+\gamma V^\pi(s_{t+1})$ 4. 事先計算 **Advantage** : $r+\gamma V^\pi(s_{t+1})-V^\pi(s_t)$ 5. 事先計算 $log(p_{\theta_{old}}(a_t|s_t))$ 6. 隨機從 **batch Sample** 出 **mini_batch** 7. 計算 $log(p_\theta(a_t|s_t))$ 8. 計算 $\large r_t(\theta)=\frac{p_\theta(a_t|s_t)}{p_{\theta_{old}}(a_t|s_t)}$ 9. `p1`、`p2` 取最小的，然後取平均作為 **loss**，優化 **Actor** 的參數，記得要加負號 10. **MSE** 計算 $V^\pi(s_t)$ 和 **Target Value** 的誤差，然後更新 **Critic** > 要特別注意這邊的算法不適用於 **Multiple Action Space** ## Conclusion 這邊的 **PPO** 是超級簡化版，論文還有使用 **GAE** 作為 **Advantage Function**，搭配 **Normal Distribution** 來使用於 **Continuous Action Space**，還有在 **Loss** 加入 **Probability Entropy** 加速收斂，之後會再把所有能讓效果變好的技巧都用上