DSP HW3 --- Author: B04705003 林子雋 --- # Notation In the following section, $W$ denotes a random variable of a word, while $q$ denotes a state. Moverover, $W_{1:t}$ means a tuple of $(W_{1}, ..., W_{t})$. If you have ever write python or matlab, you would be familar to this notation. For example, $W_{1}$ denotes the first word and $W_1 = q_1$ means the first word is $q_1$ this word(ex. $q_1 = \text{speech}$, $q_2=\text{hello}$...) # Bigram Derivation of Viterbi To derive bigram part of Viterbi algorithm, define: $$\delta_{t}(q_{i})= \underset{W_{1:t-1}}{\max}P(W_{1}, ..., W_{t-1}, W_{t}=q_{i})$$ where: $$\delta_{t}(q_{i}) = \underset{W_{1:t-1}}{\max}P(q_{i} | W_{t-1})P(W_{1:t-1}) = \underset{q_{j}}{\max}P(q_{i}| q_{j})\delta_{t-1}(q_{j})$$ For $t=1$, initialize first timestep like: $$\delta_{1}(q_{i}) = P(W_{1}=q_{i})$$ # Trigram Derivation of Viterbi To derive trigram, define: $$\delta_{t}(q_{i}, q_{j})= \underset{W_{1:t-2}}{\max}P(W_{1}, ..., W_{t-1}=q_{j}, W_{t}=q_{i})$$ Then, by same method, we can get: \begin{align} \delta_{t}(q_{i}, q_{j}) &= \underset{W_{1:t-2}}{\max}P(q_{i}|q_{j}, W_{t-2})P(W_{1},...,W_{t-2}, W_{t-1}=q_{j}) \text{ (by conditional probability decomposition)} \\ &= \underset{q_{k}}{\max}\underset{W_{1:t-3}}{\max}P(q_{i}|q_{j}, q_{k})P(W_{1},...,W_{t-3},W_{t-2}=q_{k}, W_{t-1}=q_{j}) \text{ (maximize one variable first)}\\ &= \underset{q_{k}}{\max}P(q_{i}|q_{j}, q_{k}) \underset{W_{1:t-3}}{\max}P(W_1,...,W_{t-3}, W_{t-2}=q_{k},W_{t-1}=q_{j} ) \text{ (take the independent term out)} \\ &= \underset{q_{k}}{\max}P(q_{i}|q_{j}, q_{k})\delta_{t-1}(q_{j},q_{k}) \\ \end{align} For $t=1$ and $t=2$, initialize first timestep like: $$\delta_{1}(q_{i},q_{j})=P(W_{1}=q_{i}) \\ \delta_{2}(q_{i},q_{j})=\underset{q_{k}}{\max}P(q_{i}|q_{j})\delta_{1}(q_{j},q_{k})$$