📃 Hidden Markov Models

--- tags: paper-reading --- # 📃 Hidden Markov Models 好像不完全算 paper reading，算 theory or algo-understanding。以下都是一堆數學，簡單來說就是一堆機率統計（包含貝氏）和以此延伸的最佳化。 ## 🎁 Introduction ### Notations ![](https://i.imgur.com/Z648aZ0.png) ### Two Assumptions held for the first order HMM 1. Markov Assumption: $P(q_i|q_1, ..., q_{i-1}) = P(q_i|q_{i-1})$ 轉換到現在 state，只與前一個 state 有關。 2. Output Independence: $P(o_i|q_{1:T}, o_{1:T}) = P(o_i|q_i)$ output observation $o_i$ 只和當前的 state $q_i$（state that produced $o_i$) 有關，和其他任何 states 或任何 observations 都無關。 ### HMM in A Glance 這邊從不同角度來解釋 HMM 的本質。第一個是 [Jurafsky 大佬的 Stanford NLP Appendix A](https://web.stanford.edu/~jurafsky/slp3/A.pdf) 節錄，這是我第一次讀 HMM 時用的教材。他引用了一段 Rabiner 説的話，説 HMM 是三個基本問題組成： :::info Problem 1 (Likelihood) Given an HMM $\lambda = (A, B)$ and an observation sequence $O$, determine the likelihood $P(O|\lambda)$. Problem 2 (Decoding) Given an HMM $\lambda = (A, B)$ and an observation sequence $O$ determine the best hidden state sequence Q. Problem 3 (Learning) Given an oobservation $O$ and the set of states in the HMM, learn params $A, B$. ::: 這個是 YT 影片的教學 [Youtube video: (ML 14.6) Forward-Backward algorithm for HMMs by MathematicalMonk](https://www.youtube.com/watch?v=7zDARfKVm7s)。比較演算法導向，將 HMM 看成兩個演算法組合，forward and backward (decoding) algorithms。已知 $P(q_i|q_{i-1})$ （Jurafsky 中用的 $A$）， $P(o_i|q_i)$ （Jurafsky 中用的 $B$）與 state initial probabilities（Jurafsky 中用的 $\pi$）， Forward algo 求取 $P(q_i, O_{1:i}) \forall i = 1, ..., n$，Backward algo 求取 $P(O_{i+1:T}|q_i) \forall i = 1, ..., n$。欲求 $P(q_i,|O)$。 $$ P(q_i|O) = \frac{P(q_i, O)}{P(O)} \propto \color{blue}{P(q_i, O)} \\ \color{blue}{P(q_i, O)} = \color{green}{P(O_{i+1:T}|q_i, O_{1:i})} \cdot P(q_i, O_{1:i}) \\ {P(O_{i+1:T}|q_i, \color{red}{O_{1:i}})=\color{green}{P(O_{i+1:T}|q_i)}} \; \text{//output independence} \\ P(q_i, O) = \color{green}{P(O_{i+1:T}|q_i)} \cdot P(q_i, O_{1:i}) \\ $$ $$ \color{#0097a7}{F_{u,i} = [(F_u \cap F_i), (F_u \cup F_i) / (F_u \cap F_i)]} $$ 我們可以發現 $\color{green}{P(O_{i+1:T}|q_i)}$ 就是 Backward algorithm 的計算結果，而 $P(q_i, O_{1:i})$ 是 Forward algorithm 的計算結果。故而 $P(q_i|O)$ 的計算為 Forward * Backward 的計算結果。好這邊有點講太細了先打住，不過因為以下的例子和思路是完全照著 Jurafsky 的教材，所以上面算是提供另一種看法這樣。 ## 🎁 Illustration Example Jason 吃冰淇淋的數量一球至三球（$o$）和 weather Hot, Cold（$q$）間的HMM。 ![](https://i.imgur.com/Xfhw8Vg.png) ## 🎁 Forward Algorithm ![](https://i.imgur.com/A81OT7t.png =500x) ![](https://i.imgur.com/6JOakpq.png =500x) ![](https://i.imgur.com/wIj1tgM.png) 該圖展示單一個 $\alpha_t(j)$ 的計算過程，需要考慮上一步（t-1 步）的 forward probs，從任一狀態轉移至 state j 的 transition probs $a_{ij}$ ，與 given 當下這一步的 observation $o_t$ 的 emission prob $b_j(o_t) = P(o_t|q_j)$。 1. For computing likelihood，即前面所提到 Problem 1 of HMM: Given an HMM $\lambda = (A, B)$ and an observation sequence $O$, determine the likelihood $P(O|\lambda)$. 更精確來說希望算出 $\alpha_t(j)$，其意義為 the probability of being at state $j$ at time $t$ (after seeing the first $t$ observations) given the model。以上述例子而言，就是給定 t 個冰淇淋球數的觀察值，算出此時狀態轉移到 Hot, Cold 兩個 states 的機率值（likelihood）各是多少。 2. 最單純的想法是暴力解，$T$: observation seq length。enumerate $N^T$ 個 possible hidden state sequences 並計算每個 seq 的 likelihood。將 $T$ 落在 Hot, Cold 的 seq probs 各自相加起來，則得到 $\alpha_t(Hot)$ 與 $\alpha_t(Cold)$，done。但是計算成本非常的高。 3. Forward Algo 是一種 DP，其複雜度為 $O(N^2T)$。$N$: number of states，$T$: observation seq length。 4. 初始值 $\alpha_1(j)$ 是由 initial prob 和 emission prob 相乘得出。中間的 recursive definition $\alpha_t(j)$ 則是加總上一步的 forward probs over all states $\alpha_{t-1}(i) \forall i$ weighted by transition prob $a_{ij}$ 乘以 emission prob $b_j(o_t) = P(o_t|q_j)$。最後可以得出 given an observation sequence $O$ 和 HMM $\lambda$，$P(O|\lambda)$ 為何。**要使得這個 HMM 有用，我們應該盡量最大化這個值（maximize $P(O|\lambda)$ = maximize the likelihood）。** ## 🎁 Backward (Decoding) Algorithm: Viterbi Algorithm ![](https://i.imgur.com/0MvkDTo.png =500x) ![](https://i.imgur.com/4MofBpw.png =500x) 1. 這段為前面提到的 Problem 2: Given an HMM $\lambda = (A, B)$ and an observation sequence $O$ determine the best hidden state sequence Q. 2. $v_t(j)$ 的意義為 the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence $q_{1:t−1}$, given the HMM model $\lambda$. 3. $bt_t(j)$ 代表的是 best path 的 backpointer，用以 backtrack 解答路徑（最佳的 hidden state sequence）。 4. Backward Algo 是一種 DP，其複雜度也為 $O(N^2T)$。$N$: number of states，$T$: observation seq length。可以看出 Backward 和 Forward 整體結構很像： - $v_1(j)$（初始化）和 Forward Algo 的 $\alpha_1(j)$ 一模一樣，因為此時還沒有 the most probable state sequence $q_{1:t−1}$ 可以 condition on。 - 接下來的差別在於 $v_t(j)$ 是 $max(.)$ ， $\alpha_t(j)$ 是 $sum(.)$。下圖應該滿好看出來的。 ![](https://i.imgur.com/5F0p5vf.png) ![](https://i.imgur.com/bklGoNe.png) ## 🎁 Forward-Backward Algorithm 1. HMM training 用的演算法，為以上兩者的合體。這裡和 Youtube 影片的思路合流，不過數學部分講得更複雜。這個演算法又叫 Baum-Welch algorithm (1972)，是 Expectation-Maximization （EM）演算法的一種 special case。目標即是解決 Problem 3，訓練出 HMM 的參數 $A, B$。以上的兩個演算法都是在 $A$, $B$ 為定值（given a fixed $\lambda=(A, B)$）之下計算，那要如何更新/訓練 HMM 呢？ 2. Backward probability $\beta_t(i) = P(o_{t+1:T}|q_t=1, \lambda)$， $\beta_t(i)$ 意義為 the probability of seeing the observations from time $t + 1$ to the end $T$, given that we are in state $i$ at time $t$ (and given the HMM $\lambda$)。 ![](https://i.imgur.com/DU2rdjm.png =400x) 訓練方式和 Forward Algo 很像，只是這次的計算方向是倒回來的（從 time T 一路算回 0）， 3. $\xi_t(i,j)$ 的意義為 the probability of being in state $i$ at time $t$ and state $j$ at time $t+1$, given the observation sequence and the HMM. $\xi_t(i,j) = P(q_t = i, q_{t+1} = j|O, \lambda)$ Since ![](https://i.imgur.com/Bglziwa.png =200x) $$\xi_t(i,j) \\ \quad = P(q_t = i, q_{t+1} = j|O, \lambda) \\ \quad = \frac{P(q_t = i, q_{t+1} = j, O|\lambda)}{\color{green}{P(O|\lambda)}}\\ $$ 分子的部分是在 time $t$ 發生 $(i, j)$ arc 的機率，包含連乘 transition probability 和 observation probability，乘上 condition on model 的 forward 和 backward probability。複習這兩個 probabilities 的定義： **$\alpha_t(j)$ 意義為 the probability of being at state $j$ at time $t$ (after seeing the first $t$ observations) and given the HMM $\lambda$。 $\beta_t(i)$ 意義為 the probability of seeing the observations from time $t + 1$ to the end $T$, given that we are in state $i$ at time $t$ (and given the HMM $\lambda$)。** $$ P(q_t = i, q_{t+1} = j, O|\lambda) = \alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j) \;\text{ //分子} \\ \color{green}{P(O|\lambda)} = \Sigma^N_{j=1}{\alpha_t(j)\beta_t(j)} \;\text{ //分母} \\ $$ 將分子分母替換可得： ![](https://i.imgur.com/HYXguBp.png =500x) 要算出 transitions from state $i$ to $j$ 的期望次數，只要 sum over all $t$，可得 $\Sigma_{t=1}^{T-1}{\xi_t(i,j)}$。欲算出 $\hat{a_{i,j}} = \frac{從 state \:i 至 j 的期望次數}{從 i 至任何 state 的期望次數}$，公式為： ![](https://i.imgur.com/O1rc1jO.png =300x) 4. 我們還需要額外算出 $\hat{b_{j}(v_k)} = \frac{在 state j 看到 observation \: v_k 的期望次數}{處在 state \:j 的期望次數}$ 為此我們需要算出另一個額外的機率值 $\gamma_t(q_t = j|O,\lambda) = \frac{P(q_t = j, O |\lambda)}{P(O|\lambda)}$ ![](https://i.imgur.com/rYsK7JR.png =200x) 所以最終我們可以算出 ![](https://i.imgur.com/iSZNue5.png =300x)。 5. 有了$\hat{a_{i,j}}$ 和 $\hat{b_{j}(v_k)}$的公式後，我們得以（根據教材） "re-estimate" HMM 的參數：transition probs $A$ 與 emission probs $B$，更精確來說是以 recursive 的方式定義他們，使我們可以用前面兩種演算法相似的 DP 演算法逐次更新 $A, B$ 參數的值，直到 convergence（參數值變動小於某一閾值）。given 長度為 $T$ 的觀察序列 $O$ 與一組狀態集合（非狀態序列）。 6. 以上例而言，input 一串 Jason 吃的冰淇淋球數序列 $\{1,3,1,2,...\}$ 和可能的天氣集合 Hot 與 Cold，該演算法可以隨意初始化並直接演算。若想要結果合理，通常會需要給定狀態初始機率 $P(Hot)$ 和 $P(Cold)$ （教材中使用 $\pi$），這對最後算出來的 $A, B$ 有很大影響。 ![](https://i.imgur.com/mkg420w.png) ## 🎁 Summary 這邊直接複製貼上教材內的 A.6 Summary。 This chapter introduced the hidden Markov model for probabilistic sequence classification. - Hidden Markov models (HMMs) are a way of relating a sequence of obser- vations to a sequence of hidden classes or hidden states that explain the observations. - The process of discovering the sequence of hidden states, given the sequence of observations, is known as decoding or inference. The Viterbi algorithm is commonly used for decoding. - The parameters of an HMM are the A transition probability matrix and the B observation likelihood matrix. Both can be trained with the Baum-Welch or forward-backward algorithm. ## 🎁 Appendix ### Calculation for Forward algorithm ![](https://i.imgur.com/ElCa0zI.jpg)