tags: `reinforcement learning`

深度強化學習 Ch3.2 : Q-Learning 實作

1. 實作遊戲介紹

這裡利用 GridWorld 的遊戲來測試 Q-learning 的實作，
可以去作者 Github 下載 GridWorld Script




import wget
# 下載 Gridworld.py & GridBoard.py 
wget.download("https://github.com/DeepReinforcementLearning/DeepReinforcementLearningInAction/raw/master/Errata/Gridworld.py") 
wget.download("https://github.com/DeepReinforcementLearning/DeepReinforcementLearningInAction/raw/master/Errata/GridBoard.py")

(1).遊戲規則簡介

遊戲是以下'棋盤'上進行，每次 Player 可以走一格，走到終點(+)為獲勝，陷阱(-)則是失敗

[['+',' ',' ','P'],    # + : 終點
 [' ','W',' ',' '],    # - : 陷阱
 [' ',' ',' ',' '],    # P : player
 [' ',' ','-',' ']]    # W : 牆壁

# 走路指令 : 'u':往上, 'd':往下, 'l':往左, 'r':往右

遊戲模式 :

'static' : 固定棋盤模式，使用預設固定地圖分布
'player' : 使用預設地圖，但 Player 位置隨機
'random' : 全部布置隨機

Reward 給予規則 :

未結束 : reward = -1
輸(碰到陷阱) : reward = -10
贏(到達終點) : reward = 10

(2).遊戲狀態介紹

遊戲狀態會以一個[3階四維]的陣列(Tensor)儲存，
四維的陣列分別是[玩家位置],[陷阱位置],[終點位置],[牆壁位置]
形成 ( 4 * 4 * 4 ) Shape 的陣列

ex: 玩家位置陣列
[[' ',' ',' ','P'],    
 [' ',' ',' ',' '],   
 [' ',' ',' ',' '],   
 [' ',' ',' ',' ']]

(3).遊戲操控指令

此遊戲有以下指令可以做操控













from Gridworld import Gridworld

# 建立一場遊戲
game = Gridworld( size=4, mode='static')    # size:棋盤大小(4*4)
# 查看遊戲
game.display()
# 移動 Player
game.makeMove('u')
# 獲得 Reward
game.reward()
# 遊戲狀態 (state)
game.board.render_np()

2. 訓練 Q-learning

(1). 理論想法

Q-learning 理論公式可以看成以下幾個訓練元素

$Q_{π} (a_{t}, s_{t}) ⟵ Q_{π} (a_{t}, s_{t}) - α (r_{t + 1} + γ \underset{a}{m a x} Q_{π} (s_{t + 1}, A) - Q_{π} (a_{t}, s_{t}))$

$Q_{π}$ : 神經網路 Model
整體公式 : 訓練過程

等於說要建一個 Q 函數的神經網路 model，利用來預測當前狀態各動作的價值，
並利用以上 Update 公式進行 Q model 權重的更新。

(2). 神經網路架構 ( Q 函數架構 )

Q 函數會輸出該狀態各動作的期望價值，所以會有以下 input、output

input : 遊戲當前狀態，為 ( 4 * 4 * 4 ) 矩陣
output : 各動作價值，此遊戲有 4 種動作，輸出四個價值!

神經網路架構:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

其他訓練元素

Loss function : MSE (Mean Square Error)
Optimizer(優化器) : Adam
learning rate : 0.001
$γ$ 折扣係數 : 0.9
$ϵ$ 貪婪係數 : 1.0 (最開始設 1 隨機探索)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

神經網路程式碼





















# 神經網路設定
layer1_size = 4*4*4
layer2_size = 150
layer3_size = 100
layer4_size = 4

model = nn.Sequential(
    nn.Linear(in_features=layer1_size, out_features=layer2_size),
    nn.ReLU(),
    nn.Linear(layer2_size, layer3_size),
    nn.ReLU(),
    nn.Linear(layer3_size, layer4_size)
)

loss_fn = nn.MSELoss()
lr = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# 折扣係數
gamma = 0.9
epsilon = 1     # epsilon-貪婪策略係數

數字轉動作指令

到時候 output 會是各動作的期望價值，如果選擇了其中的動作，
會需要把 index 換成[動作指令]，所以利用字典轉換







# 數字對應動作 (字典)
action_set = {
    0:'u',
    1:'d',
    2:'l',
    3:'r'
}

(3). 訓練架構

可以將訓練架構分為如下，前面 4 步驟(黃色格)就是為了計算 TD-Target，
之後就和 Q-learning Update 步驟相同，
特別說一下這次訓練是訓練 'static' 模式，所以是固定環境位置

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

訓練程式碼
































































































from IPython.display import clear_output    # 印出資訊使用(非必要)

epochs = 1000
losses = []     # 紀錄 loss(用來印出)


for i in range(epochs):
    # 建立遊戲
    game = Gridworld(size=4, mode='static')
    # 獲得遊戲狀態 State
    state_ = game.board.render_np().reshape(1,64) + np.random.rand(1,64)/10 # 將 shape(4*4*4) => 64，並加上雜訊
    # 將當前狀態轉為 Tensor
    state1 = torch.from_numpy(state_).float()
    # 追蹤是否還繼續遊戲
    status = 1              # 1:還在繼續

    
    while(status == 1):
        
        # ------------------------------ 當前預測 Q 值 ----------------------------------------
        
        qval = model(state1)        # 得到預測 Q
        qval_ = qval.data.numpy()   # 將預測值轉為 numpy 陣列
        
        # ------------------------------ 選擇執行動作 (使用 epsilon-貪婪策略) ----------------------------------------
        
        if(random.random() < epsilon):
            action_ = np.random.randint(0,4)    # 選擇隨機動作
        else:
            action_ = np.argmax(qval_)          # 選擇最大動作(數字)
        action = action_set[action_]            # 數字轉換為對應動作
        
        
        # ------------------------------ 執行動作、更新State、取得 Reward -----------------------------
        
        # 執行動作
        game.makeMove(action)           
        # 取得新狀態
        state2_ = game.board.render_np().reshape(1,64) + np.random.rand(1,64)/10.0
        state2 = torch.from_numpy(state2_).float()
        # 取得 Reward
        reward = game.reward()
        
        
        # ------------------------------ 取得下一狀態最大 Q 值 --------------------------------------
        
        # 預測下一狀態 Q (但不要產生運算圖)
        with torch.no_grad():
            newQ = model(state2.reshape(1,64))
        # 取得最大 Q
        maxQ = torch.max(newQ)
        
        
        # ------------------------------ 計算 TD-Target(Y) -----------------------------------------

        if(reward == -1):
            Y = reward + ( gamma * maxQ )
        else:
            Y = reward      # 遊戲已結束，無下一狀態，設 Y 為 reward
            
            
        # ------------------------------ 獲得現在狀態 Q 值和 TD-target ------------------------------
        
        Y = torch.Tensor([Y]).detach()            # 分離此預測值網路(只更新預測 qval 的 model )
        X = qval.squeeze()[action_]               # 只取出預測執行的動作 Q，並去掉一階
        
        
        # ------------------------------ 計算 loss (TD-ERROR) --------------------------------------

        loss = loss_fn(X,Y)
        
        # 印出資訊(每 100 epoch 印一次)
        if( i%100 == 0 ):
            print( i, loss.item() )
            clear_output(wait = True)   
            

        # ------------------------------ Update 神經網路(Q function) ---------------------------------

        optimizer.zero_grad()
        loss.backward()         
        optimizer.step()
        
        # 將新狀態設為當前狀態
        state1 = state2
        if abs(reward) == 10:
            status = 0          # 如遊戲結束 status 設為0
            
            
    # 遞減 epsilon
    if(epsilon > 0.1):
        epsilon -= (1/epochs)
            
    # 紀錄 loss
    losses.append(loss.item())

訓練結果




plt.plot(losses)
plt.xlabel("Epoches", fontsize=11)
plt.ylabel("Loss", fontsize=11)
plt.show()

可以看到 Loss 明顯下降

3. 實測模型玩遊戲

定義以下函數來實測訓練的模型，架構與訓練有點像，但只需要 model 預測值
函數會回傳贏 or 輸，Display可以顯示遊戲過程

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

測試函數程式碼




















































# 測試 model 玩遊戲
def test_model( model, mode='static', display=True):
    i = 0
    # 創建遊戲
    test_game = Gridworld(size = 4, mode=mode)
    # 當前狀態
    state_ = test_game.board.render_np().reshape(1,64) + np.random.rand(1,64)/10.0
    state = torch.from_numpy(state_).float()
    
    if display:
        print("Initial State:")
        print(test_game.display())
        
    status = 1
    while(status == 1):
        
        # 選擇動作
        qval = model(state)
        qval_ = qval.data.numpy()
        action_ = np.argmax(qval_)
        action = action_set[action_]
        if display:
            print(f"Move #: {i}; Taking action: {action}")
        
        # 進行動作、更新當前狀態
        test_game.makeMove(action)
        state_ = test_game.board.render_np().reshape(1,64) + np.random.rand(1,64)/10.0
        state = torch.from_numpy(state_).float()
        if display:
            print(test_game.display())
            
        # 獲得 Reward，判斷輸贏
        reward = test_game.reward()
        if( reward != -1 ):
            if reward > 0:      # 贏了
                status = 2
                if display:
                    print(f"Game Won! Reward: {reward}")
            else:               # 輸了
                status = 0
                if display:
                    print(f"Game Lost! Reward: {reward}")
        
        i += 1
        if( i > 15 ):
            if display:
                print("Game Lost; too many moves.")
            break
        
    # 回傳輸贏結果
    win = True if status == 2 else False
    return win

實測模型


test_model( model, 'static',display=True)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

實測結果

災難性失憶

但如果我們想要訓練 Random 環境 GridWorld 會發現此模型不可用
所以我將訓練步驟的遊戲模式改成 random ，但卻獲得不太好結果…

這是因為模型發生災難性失憶，會在下一篇講解。

tags: reinforcement learning

深度強化學習 Ch3.2 : Q-Learning 實作

1. 實作遊戲介紹

(1).遊戲規則簡介

(2).遊戲狀態介紹

(3).遊戲操控指令

2. 訓練 Q-learning

(1). 理論想法

(2). 神經網路架構 ( Q 函數架構 )

其他訓練元素

數字轉動作指令

(3). 訓練架構

3. 實測模型玩遊戲

災難性失憶

Read more

深度強化學習 Ch4.1：策略網路

深度強化學習 Ch2 : 馬克夫決策過程

深度強化學習 Ch1 : 基本觀念

深度強化學習 Ch5 : Actor-Critic Model

tags: `reinforcement learning`