Model-based RL（AlphaZero / MuZero / MCTS / DP）

--- hackmd: url: https://hackmd.io/yriu1w1AQqKXKnZLstWYCQ title: Model-based RL（AlphaZero / MuZero / MCTS / DP） lastSync: 2025-12-12T10:47:33.346Z --- # Model-based RL（AlphaZero / MuZero / MCTS / DP） ## 核心概念 Model-based RL 的目標是： **學習環境模型 → 用模型進行規劃 → 產生更好的策略**。 ### 學習的模型包含： * **Transition model**：$\hat{P}(s'|s,a)$ * **Reward model**：$\hat{r}(s,a)$ * **Dynamics model（MuZero）**：latent state 更新 **Pros & Cons** - 優點: 高樣本效率 (Sample Efficiency) (能從較少的互動中學習得更快)。 - 缺點: 易受模型偏差 (Model Bias) 影響 (模型的誤差會在規劃過程中累積)。 ## Dynamic Programming 在已知模型下的 planning： ### Policy Evaluation $$ V(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[r + \gamma V(s')] $$ ### Policy Improvement $$ \pi_{new}(s) = \arg\max_a \sum_{s'} P(s'|s,a)[r + \gamma V(s')] $$ ### Value Iteration $$ V(s) \leftarrow \max_a \sum_{s'} P[r + \gamma V(s')] $$ DP 是 **planning 的理論根基**。 ## Monte-Carlo Tree Search（MCTS） ### 流程四步： 1. Selection: 使用樹策略 (Tree Policy，例如 UCB / PUCT) 遍歷樹。 2. Expansion: 新增一個子節點到樹中。 3. Simulation: - 透過 rollout (經典方法) 或神經網路評估 (AlphaGo) 來估計價值。 - **逐一擴展** (One-by-One): 在標準 MCTS 中，當搜尋到達一個葉節點，且該節點還有未被探索過的動作 (Unvisited Actions) 時，通常只會選擇其中一個未訪問的動作來新增子節點。 - 原因: 這是為了節省記憶體並快速進入模擬 (Rollout) 階段。新節點通常沒有先驗資訊，其價值完全依賴隨後的模擬結果。 - **一次性全擴展 (All-at-Once)**: 神經網路一次計算並展開該狀態下所有合法動作的子節點。 - 因為推論成本高昂，一次取得所有資訊較有效率。 4. Backpropagation: 向上更新統計數據 ($N$, $Q$) 至根節點。 ### 選擇策略公式比較標準 MCTS 與 AlphaGo 使用不同的選擇公式來平衡 Exploration & Exploitation。 1. 標準 UCB1 公式 (經典 MCTS) 用於沒有先驗知識 (Prior Knowledge) 的情況，純粹基於統計數據。 $$ UCB = \underbrace{\frac{w_i}{n_i}}_{\text{Exploitation}} + \underbrace{c \sqrt{\frac{\ln N}{n_i}}}_{\text{Exploration}} $$ $\frac{w_i}{n_i}$: 該節點的平均勝率 (利用)。 $c \sqrt{\frac{\ln N}{n_i}}$: 探索項。當總訪問次數 $N$ 增加但該節點訪問次數 $n_i$ 很少時，此項會變大，鼓勵探索未訪問過的節點。 2. PUCT 公式 (AlphaGo / MuZero) 用於結合神經網路提供的先驗機率 (Prior Probability)，以引導搜尋方向。 $$ U(s,a) = Q(s,a) + c_{puct} \cdot P(s,a) \frac{\sqrt{\sum_b N(s,b)}}{1 + N(s,a)} $$ $Q(s,a)$: 平均動作價值 (利用)。 $P(s,a)$: 先驗機率，來自策略網路 $\pi_\theta(a|s)$。這提供了「直覺」，讓搜尋優先考慮神經網路認為好的動作。 $\frac{\sqrt{\sum N}}{1+N}$: 探索衰減項。隨著節點被訪問次數增加，先驗機率 $P$ 的影響力會下降，搜尋會逐漸轉向依賴真實的 $Q$ 值。 ## AlphaGo: 混合架構 (The Hybrid Architecture) AlphaGo 是第一個混合系統，結合了「直覺」(神經網路) 與「計算」(MCTS)。 - Policy Network: 縮減搜尋的廣度。 - Value Network: 縮減搜尋的深度。 ### MCTS Process Stages #### Selection: Traverse the tree from the root to a leaf node. Selection is guided by maximizing the PUCT value: $$a_t = \text{argmax}_a (Q(s_t, a) + u(s_t, a))$$ $Q(s, a)$ (Action Value): This is the average value of all simulations that took action $a$ from state $s$. $$Q(s, a) = \frac{W(s, a)}{N(s, a)}$$ $W(s, a)$: Total accumulated value from simulations passing through this edge. $N(s, a)$: Visit count of this edge. #### Expansion: - Expand the leaf node by adding children for ==all legal moves== simultaneously. - The Policy Network ($P_\sigma$) evaluates the leaf node and outputs a probability distribution (vector) for these moves. - Initialization: Each new child node is initialized with its corresponding prior probability ($P$) from the network output. - Effect on Selection: These stored probabilities allow the Selection phase in future simulations to immediately consider any of these moves based on their PUCT values ($Q + u(P)$), without needing individual rollouts for discovery. #### Evaluation: Evaluate the newly expanded node. The evaluation is a combination of: - Value Network ($v_\theta$): estimating the value of the position directly. - Fast Rollout: Running a simulation to a terminal state to get a result. Let $V$ be the combined evaluation result (mixing $v_\theta$ and rollout result $z$). $$V(s_L) = (1-\lambda) \cdot v_\theta(s_L) + \lambda \cdot z_{rollout}$$ :::info **Fast Rollout:** Rollout with simple model. ::: #### Backup: Propagate the evaluation result $V$ back up the tree path to the root. Update Rule: For each node visited: - Increment visit count: $N(s, a) \leftarrow N(s, a) + 1$ - Add value to total: $W(s, a) \leftarrow W(s, a) + V$ - Update mean Action Value $Q$: $$ Q(s, a) \leftarrow \frac{W(s, a)}{N(s, a)} $$ ### 訓練 **複雜的多階段流程 (監督式學習 $\to$ REINFORCE $\to$ 回歸)。** ### Policy Network Objective: Use stochastic policy gradient ascent to maximize the likelihood of the human move $a$ selected in state $s$. Update Rule: $$\Delta\theta = \alpha\nabla_{\theta} \log\pi_{\theta}(s_t, a_t) \cdot z$$ $\theta$: network parameter. $\alpha$: learning rate. $z$: the value of the episode win/loss (1/-1) of the game. ### Value Network Goal: estimate a value (or a win rate) for any given game position. - $v^*(s)$: perfect play optimal value function. - $v^{\rho}(s)$: estimate value function by RL policy $P_{\rho}$. Train a value network $v_{\theta}(s)$ with weights $\theta$ to approximate $v^{\rho}(s)$. $v_{\theta}(s) \approx v^{\rho}(s) \approx v^*(s)$ Update Rule: $$\Delta\theta \propto \frac{\partial v_{\theta}(s)}{\partial\theta}(z - v_{\theta}(s))$$ $z$ is outcome. ### Network Update Policy Network 和 Value Network 不是在 MCTS 裡更新。 - 用 MCTS 產生的資料 - 在 MCTS 結束之後 - 用標準 SGD / policy gradient / regression loss 來更新 - :::warning MCTS 是「資料產生器」，不是「參數更新器」。 ::: #### Phase 1 : 監督式學習 (Supervised Learning) 資料來源：人類棋譜 $(s_t, a_t^{human})$ 完全不需要 MCTS，直接 offline 訓練 Loss $$\mathcal{L}_{policy} = -\log \pi_\theta(a_t^{human} \mid s_t)$$ #### Phase 2 : 強化學習 (REINFORCE , self-play) $$\Delta \theta = \alpha \nabla_\theta \log \pi_\theta(s_t, a_t) \cdot z$$ 用目前的 policy network + MCTS $\to$ 完整下完一盤棋收集資料： $$(s_1, a_1), (s_2, a_2), \dots, (s_T, a_T)$$ 遊戲結束，得到 outcome $$z \in \{+1, -1\}$$ #### Phase3 : 回歸到 training loop，對整盤棋做： $$\sum_t \nabla_\theta \log \pi_\theta(s_t, a_t) \cdot z$$ Value Network 的 loss $$\mathcal{L}_{value} = (v_\theta(s_t) - z)^2$$ 梯度： $$\Delta \theta \propto \frac{\partial v_\theta(s)}{\partial \theta} (z - v_\theta(s))$$ **流程** ```pseudocode initialize policy net (SL) repeat: self-play games: for each move: run MCTS using current policy + value net choose move from visit counts collect (s, a, z) update policy net (policy gradient) update value net (regression) ``` ## AlphaZero: The Unified "Teacher" ### 關鍵特色 - 統一網路: 使用單一網路具備兩個輸出頭 $(p, v) = f_\theta(s)$。 - 策略網路: 輸出先驗機率 $P(s,a)$ 供 MCTS 選擇使用。 - 價值網路: 直接評估葉節點 (不再需要隨機 rollout)。 - Tabula Rasa (從零開始): 純粹從自我對弈中學習，無需人類資料。 ![AlphaZero pipeline](https://hackmd.io/_uploads/B1aOEi9MWg.png) ### Self-Play (自我對弈) MCTS Search: 在每個盤面 $s$，執行由 $f_\theta$ 引導的 MCTS 搜索，每個搜尋結束後，輸出每個動作的機率 $\pi$。 Search Probabilities: 搜索結束後，返回的機率 $\pi$ 與訪問次數 $N$ 成正比： $$\pi \propto N^{1/\tau}$$ $N$: 從根節點開始，每個動作的訪問計數 (visit count)。 $\tau$: 溫度參數 (Temperature parameter)。 - 前 30 步：$\tau = 1$ (探索)。 - 剩餘步數：$\tau \rightarrow 0$ (利用/確定性)。 ### Optimization 當遊戲在步驟 $T$ 結束時 (雙方 Pass、投降或超過最大步數)，給予最終獎勵： $$r_T \in \{-1, +1\}$$ Loss Function (總損失函數) 參數 $\theta$ 透過梯度下降調整，Loss 函數包含 `MSE` 和 `Cross-Entropy`： $$l = (z - v)^2 - \pi^T \log p + c||\theta||^2$$ $(p, v) = f_\theta(s)$: 神經網路對當前狀態的輸出（機率向量 $p$ 和價值 $v$）。 - $(z - v)^2$: Value error (MSE)。 - $-\pi^T \log p$: Policy error (Cross-Entropy)。 - $c||\theta||^2$: L2 Weight Regularization (防止過擬合)。 Checkpoint: 每 $n$ 個訓練步驟產生一個新的 Checkpoint。 ### Evaluator **目標 (Goal)** 確保總是生成最高品質的訓練數據 (Best quality data)。 **評估機制** 比較當前最佳網路 ($f_{\theta^*}$) 與經過優化後的新 Checkpoint ($f_{\theta_i}$) 的對弈強度。 $$f_{\theta^*} = \begin{cases} f_{\theta_i}, & \text{if } f_{\theta_i} \text{ wins } > 55\% \text{ against } f_{\theta^*} \\ f_{\theta^*}, & \text{otherwise} \end{cases}$$ - $f_{\theta^*}$: Current best network (當前最強網路)。 - $f_{\theta_i}$: Network checkpoint by optimization (訓練中的新候選網路)。新的 $f_{\theta^*}$ 將被用於下一輪的 Self-Play 數據生成。 ### Weaknesses of AlphaZero (已知弱點) - Long Groups: 對於長龍/大龍的判斷通常較弱。 - Ladder Problems: 仍然會判斷錯誤某些「徵子」(Ladder) 問題。 - Circular Problems: 在大多數循環劫/循環問題上會出錯 (近期發現)。潛在解法: Solutions by Transformers? (是否可以用 Transformer 架構解決？) ## MuZero (deterministic env) MuZero 不知道真實 transition model。它會學 3 個 model： ### (1) Representation model $$ s_0 = h(o_0) $$ ### (2) Dynamics model $$ s_{t+1}, r_t = g(s_t, a_t) $$ ### (3) Prediction model $$ (p, v) = f(s) $$ MuZero 用自己的 latent model 展開 MCTS。 ### 簡介與背景 > 歷史背景： David Silver 先前曾指出「在 Atari 遊戲中，基於模型 (Model-based) 的強化學習目前為止是失敗的」。MuZero 的目標正是要解決這個問題。核心組件： MuZero 演算法的訓練包含兩個主要部分： - 自我對弈 (Self-play)：產生用於訓練的軌跡 (trajectories)。 * 與環境互動。 * 在不使用模擬器的情況下執行 MCTS (蒙地卡羅樹搜尋)，這對於「重量級 (heavyweight)」環境至關重要。 - 神經網路訓練：學習自我對弈的 value、MCTS policy 以及系統動態 (dynamics)。 ### AlphaZero 與 MuZero 的比較 - AlphaZero - 結構：使用 1 個網路。 - 功能：預測 $f: s \to p, v$ (狀態 $\to$ 策略, 價值)。 - 限制：雖然對「輕量級」環境 (如圍棋) 有效，但不適用於無法輕易取得完美模擬器或模擬效率低落的「重量級」環境。 - MuZero 引入了一個學習模型來處理環境動態，將任務拆分為 3 個不同的網路： - 預測網路 (Prediction, $f$)： $s \to p, v$ - 根據內部狀態預測 policy 和value。 - 動態網路 (Dynamics, $g$)： $s, a \to r, s$ - 給定當前狀態和動作，預測即時獎勵 (or k-step reward) 和下一個內部狀態。 :::warning 取代了對完美模擬器的需求。 ::: - 表徵網路 (Representation, $h$)： $o \to s$ - 將原始觀測值 ($o$) 轉換為嵌入狀態 ($s$)。 ### MuZero 中的 MCTS 規劃規劃過程利用上述定義的三個網路，在潛在空間 (latent space) 中模擬未來的軌跡。 $h$: 表徵網路 (Representation Network) 透過嵌入觀測值來初始化樹搜尋。 $$s^0 = h_\theta(o_1, \dots, o_t)$$ $g$: 動態網路 (Dynamics Network) 在潛在空間中模擬轉移 (transitions)。 $$r^k, s^k = g_\theta(s^{k-1}, a^k)$$ $f$: 預測網路 (Prediction Network) 估計葉節點的策略和價值。 $$p^k, v^k = f_\theta(s^k)$$ ### 訓練過程訓練目標是將神經網路的預測與自我對弈和 MCTS 的實際結果對齊。 - Search Policy - 讓 $p_t^k$ 預測 $\pi_{t+k}$ with Cross-entropy。 - Value - 讓 $v_t^k$ 預測 $z_{t+k}$ (最終獎勵或 $k$ 步回報)。 - 對於棋盤遊戲： $z_{t+k}$ 是最終的 $0/1$ (均方誤差 Mean Squared Error，類似 AlphaZero)。 - 對於 Atari： $z_{t+k}$ 是 $k$ 步目標 (交叉熵 Cross-entropy)。 - Observed Reward - 讓 $r_t^k$ 預測 $u_{t+k}$，損失函數類似於價值預測。損失函數 (Loss Function) 總損失是 $K$ 步內的獎勵、價值和策略損失的總和，加上正規化項： $$l_t(\theta) = \sum_{k=0}^K \left[ \underbrace{l^r(u_{t+k}, r_t^k)}_{\text{獎勵損失}} + \underbrace{l^v(z_{t+k}, v_t^k)}_{\text{價值損失}} + \underbrace{l^p(\pi_{t+k}, p_t^k)}_{\text{策略損失}} \right] + \underbrace{c||\theta||^2}_{\text{L2 正規化}}$$ :::warning Representation Network will be updated implicit since it provides embedding state. By backpropagation, it will be updated although there is no loss formula related to it. ::: :::info MuZero is a strategy that can apply to others. E.g. DQN. ::: ## MuZero (stochastic env) ![Demo](https://hackmd.io/_uploads/BJo7ReiGWg.png) ![Update](https://hackmd.io/_uploads/HypLCxoz-l.png) ### Notation Definitions * $as^k$: Latent afterstate * $\sigma^k$: Distribution over chance outcomes * $Q^k$: Afterstate value(state value) * $c_t$: Chance outcome ### Model Architecture In stochastic environments (like the game 2048), the standard MuZero dynamics are modified to account for `afterstates` (the state after an action is taken but before the environment's stochastic element resolves). ### Core Functions **Representation** $$s_t^0 = h(o_{\le t})$$ **Prediction** $$p_t^k, v_t^k = f(s_t^k)$$ **Afterstate Dynamics** $$as_t^k = \phi(s_t^k, a_{t+k})$$ **Afterstate Prediction** $$\sigma_t^k, Q_t^k = \psi(as_t^k)$$ **Dynamics** $$s_t^{k+1}, r_t^{k+1} = g(as_t^k, c_{t+k+1})$$ ### Chance Outcomes & Encoding To handle stochasticity, the model uses an encoder trained end-to-end (based on VQ-VAE principles). #### Encoder Mechanism - Input: Observations $o_t$, $o_{t+1}$ (and likely action $a_t$ (unknown)). - Process: The encoder maps inputs to a continuous latent vector $z$, which is discretized. - Output: A discrete chance outcome code $c_t$ (one-hot vector). 環境是隨機的，所以同一個狀態 + action，可能會有多個「不同結果」。這裡做的事是：把「隨機結果」離散化成有限個 chance outcome，用 VQ-VAE 當作 encoder，把 observation 對齊到某一個離散 outcome。 $o_t \to$ 不同 $o_{t+1}$ 看成不同的 chance outcome。 Encoder 的輸入是 **observation** 輸出 $c^e_t$ 一個 **logits / embedding 向量** #### 解釋 VQ-VAE 標準 VQ-VAE 的核心操作其實是 **Argmin (找最小距離)**，而不是 Argmax (找最大機率)，雖然兩者在概念上是鏡像的。假設你的 Encoder 輸出了一個向量 $z_e(x)$，你的 Codebook 裡有一堆標準向量 $e_k$。 ##### 具體步驟 (演算法流程) 1. **Encoder 輸出** 拿到一個連續向量，例如 $z_e(x) = [1.2, 0.8]$。 2. **計算距離** 算出這個向量跟 Codebook 裡**每一個**向量的距離 (Euclidean Distance)。 - Codebook 向量 A $[1.0, 1.0]$ $\to$ 距離 $0.2^2 + (-0.2)^2 = 0.08$ - Codebook 向量 B $[0.0, 0.0]$ $\to$ 距離 $1.2^2 + 0.8^2 = 2.08$ - Codebook 向量 C $[5.0, 5.0]$ $\to$ 距離... (很遠) 3. **Argmin (硬指派)** - 看誰距離最小？向量 A。 - **動作**：直接選定 A 的索引 (Index)，或者直接把 $z_e(x)$ 替換成 $A$。 - 這一步是 `k* = argmin || z_e(x) - e_k ||`。 :::warning **為什麼這裡不需要機率？** 因為事情已經發生了（$o_{t+1}$ 已知），這是一個**既定事實**。這就是為什麼這裡用 `argmax` 直接鎖定結果。 ::: :::info 如果 VQ-VAE 這麼「硬」，為什麼我們常在論文裡看到機率 (Probability)？這段文字解釋了三個來源： #### A. 人為解讀 (Soft Assignment) - 「把距離當機率」這是一種**後處理**手段，而不是模型原本的輸出。 - **邏輯**：如果你的 Embedding 離 Codebook A 很近，離 B 很遠，我們可以用數學公式 (Softmax) 強行把它轉成機率：「因為離 A 很近，所以 A 的機率高」。 - **用途**：通常用於視覺化或某些特定的梯度傳遞技巧，但這不是原版 VQ-VAE 的核心機制。 #### B. 另一個模型 (Prior) - 「預測未來的機率」這是 VQ-VAE 生成圖片的真正關鍵。 - **邏輯**：Encoder 只是把圖片變成一串數字代碼 (比如 `[5, 12, 3]`)。接著，我們會訓練另一個模型 (Prior, 例如 Transformer 或 PixelCNN) 來學習這些代碼的規律。 - **例子**：Prior 模型會學到「如果前面出現了 5 和 12，下一個出現 3 的**機率**是很高的」。 - **結論**：機率存在於這個 Prior 模型裡，而不是 Encoder 裡。 #### C. 變體模型 (Gumbel-Softmax) - 「真正的機率 Encoder」 - **邏輯**：有一種改良版叫做 Gumbel-Softmax，它是真的讓 Encoder 輸出機率分佈，然後從中採樣。但這已經不算「標準的 VQ-VAE」了。 ::: ### Loss Function The total loss combines the standard MuZero loss with specific losses for handling stochasticity (chance). $$L^{total} = L^{MuZero} + L^{chance}$$ #### MuZero Loss ($L^{MuZero}$) Standard reconstruction of policy, value, and reward. $$L^{MuZero} = \sum_{k=0}^K l^p(\pi_{t+k}, p_t^k) + \sum_{k=0}^K l^v(z_{t+k}, v_t^k) + \sum_{k=1}^K l^r(u_{t+k}, r_t^k)$$ - Policy Loss: $l^p(\pi, p)$ - State Value Loss: $l^v(z, v)$ - Reward Loss: $l^r(u, r)$ #### Chance Loss ($L^{chance}$) Handles the prediction of afterstate values and chance outcomes. $$L^{chance}_w = \sum_{k=0}^{K-1} l^Q(z_{t+k}, Q_t^k) + \sum_{k=0}^{K-1} l^\sigma(\sigma_t^k, c_{t+k+1}) + \beta \sum_{k=0}^{K-1} \| c_{t+k+1} - c^e_{t+k+1} \|^2$$ - Afterstate Value Loss $l^Q(z_{t+k}, Q_t^k)$ - Ensuring the afterstate value estimate matches the target. - Chance Outcome Loss $l^\sigma(\sigma_t^k, c_{t+k+1})$ - Cross-entropy loss for predicting the correct chance outcome distribution. - VQ-VAE Commitment Cost $\beta \sum \| c_{t+k+1} - c^e_{t+k+1} \|^2$ - Keeps the encoder output close to the embedding space (stabilizing the discrete representation).