DSP HW3
---
Author: B04705003 林子雋
---
# Notation
In the following section, $W$ denotes a random variable of a word, while $q$ denotes a state.
Moverover, $W_{1:t}$ means a tuple of $(W_{1}, ..., W_{t})$. If you have ever write python or matlab, you would be familar to this notation.
For example, $W_{1}$ denotes the first word and $W_1 = q_1$ means the first word is $q_1$ this word(ex. $q_1 = \text{speech}$, $q_2=\text{hello}$...)
# Bigram Derivation of Viterbi
To derive bigram part of Viterbi algorithm, define:
$$\delta_{t}(q_{i})= \underset{W_{1:t-1}}{\max}P(W_{1}, ..., W_{t-1}, W_{t}=q_{i})$$
where:
$$\delta_{t}(q_{i}) = \underset{W_{1:t-1}}{\max}P(q_{i} | W_{t-1})P(W_{1:t-1}) = \underset{q_{j}}{\max}P(q_{i}| q_{j})\delta_{t-1}(q_{j}) $$
For $t=1$, initialize first timestep like:
$$\delta_{1}(q_{i}) = P(W_{1}=q_{i})$$
# Trigram Derivation of Viterbi
To derive trigram, define:
$$\delta_{t}(q_{i}, q_{j})= \underset{W_{1:t-2}}{\max}P(W_{1}, ..., W_{t-1}=q_{j}, W_{t}=q_{i})$$
Then, by same method, we can get:
\begin{align}
\delta_{t}(q_{i}, q_{j}) &= \underset{W_{1:t-2}}{\max}P(q_{i}|q_{j}, W_{t-2})P(W_{1},...,W_{t-2}, W_{t-1}=q_{j}) \text{ (by conditional probability decomposition)} \\
&= \underset{q_{k}}{\max}\underset{W_{1:t-3}}{\max}P(q_{i}|q_{j}, q_{k})P(W_{1},...,W_{t-3},W_{t-2}=q_{k}, W_{t-1}=q_{j}) \text{ (maximize one variable first)}\\
&= \underset{q_{k}}{\max}P(q_{i}|q_{j}, q_{k}) \underset{W_{1:t-3}}{\max}P(W_1,...,W_{t-3}, W_{t-2}=q_{k},W_{t-1}=q_{j} ) \text{ (take the independent term out)} \\
&= \underset{q_{k}}{\max}P(q_{i}|q_{j}, q_{k})\delta_{t-1}(q_{j},q_{k}) \\
\end{align}
For $t=1$ and $t=2$, initialize first timestep like:
$$ \delta_{1}(q_{i},q_{j})=P(W_{1}=q_{i}) \\
\delta_{2}(q_{i},q_{j})=\underset{q_{k}}{\max}P(q_{i}|q_{j})\delta_{1}(q_{j},q_{k}) $$