tags: `reinforcement learning`

深度強化學習 Ch5 : Actor-Critic Model

1. 簡介

Actor-Critic 綜合了策略梯度法以及 Q-Learning 的特性，
會利用 [價值函數] 來評估 s 狀態的價值 (注意，這邊價值為狀態價值

V_{π}

)
並利用 [策略函數] 輸出 s 狀態動作的分布機率來決定 action，
執行動作後，再利用得到的 Return 和剛預測的價值計算 "advantage" 來衡量動作好壞

可以把 actor critic 看成兩個神經網路，分別代表以下功能 :

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

- Actor(演員) : 也就是 [策略函數]，用來決定接下來的動作

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

- Critic (評論家) : [價值函數] 用來評估狀態的價值

利用 actor 和 critic 相互評估學習後得到更好的模型
概念上與 GAN 雷同 !

2. Advantage 優勢值

優勢值可以說是用來衡量在 "某個狀態" 做 "某個動作" 的好壞 (不跟價值搞混所以稱"優勢")，
用以下舉例來解說 :
當我們在 [ 狀態t ] 時，此狀態資料會被做兩件事情
會分別送入兩神經網路 (actor、critic) 計算價值
(如下圖)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

當我們有 [狀態價值] 和 [動作的 Return] 後就可以計算 advantage

A d v a n t a g e = (R - V^{π} (s))

直觀意義就是，實際這個狀態下產生的價值，減去預測的價值

3. 演算法步驟

概念程式碼


















gamma = 0.9
for i in epochs:
    state = environment.get_state()
    # Critic
    state_value = crtic(state)
    
    # actor
    policy = actor(state)
    action = policy.sample()    # 選擇動作
    next_state, reward = environment.take_action(action)
    
    # 計算 advantage
    value_next = crtic(new_state)
    advantage = (reward + gamma*value_next) - state_value
    
    # 計算並最小化 loss
    loss = -1 * policy.logprob(action) * advantage
    minimize(loss)

!!! 概念上還在研讀 (停耕)

tags: reinforcement learning

深度強化學習 Ch5 : Actor-Critic Model

1. 簡介

2. Advantage 優勢值

3. 演算法步驟

Read more

深度強化學習 Ch2 : 馬克夫決策過程

深度強化學習 Ch1 : 基本觀念

深度強化學習 Ch2.4 : 策略函數 & 價值函數 & Return

深度強化學習 Ch3.1 : TD learning

tags: `reinforcement learning`