# DD2380 HMM
1. *This problem can be formulated in matrix form. Please specify the initial probability vector π, the transition probability matrix A and the observation probability matrix B.*
$$\pi = [0.5, 0.5]$$
$$
A =
\begin{pmatrix}
0.5 & 0.5 \\
0.5 & 0.5
\end{pmatrix}
$$
$$
B =
\begin{pmatrix}
0.9 & 0.1 \\
0.5 & 0.5
\end{pmatrix}
$$
2. *First, we need to multiply the transition matrix with our current estimate of states. What is the result of this operation?*
Next estimated likelihood of states.
3. *In the following, the result must be multiplied with the observation matrix. What is the result of this operation?*
Next estimated likelihood of observations.
4. *Why is it valid to substitute $O1:t = o1:t$ with $Ot = ot$ when we condition on the state $Xt = xi$?*
Markov property: if we know $X_t$, then $O_t$ is conditionally independent of $O_1:t-1$
5. *How many values are stored in the matrices $δ$ and $δ^{idx}$ respectively?*
$\delta: T \times N$
$\delta^{idx}: (T-1) \times N$
6. *Why do we need to divide by the sum over the final α values for the di-gamma function?*
$$P(X_t = x_i, X_{t+1} = x_j | O_{1:T} = o_{1:T}) =$$
$$= \frac{P(X_t = x_i, X_{t+1} = x_j \wedge O_{1:T} = o_{1:T})}{P(O_{1:T} = o_{1:T})}$$
Sum of $\alpha_{T}$ is $\sum_{j=1}^N \alpha_T(j) = P(O_{1:T}=o_{1:T})$ (eq. 2.15) which is why it is the denominator.
7. *Train an HMM with the same parameter dimensions as above, i.e. A should be a 3 times 3 matrix, etc. Initialize your algorithm with the following matrices:*
**Does the algorithm converge?** Ans: Yes
**How many observations do you need for the algorithm to converge?** Ans: $\approx 450$
**How can you define convergence?** Ans: We can define convergence as $P(\mathcal{O}|\lambda_i) - P(\mathcal{O}|\lambda_{i+1}) \to 0$ as $i \to \infty$.
8. *Train an HMM with the same parameter dimensions as above, i.e. A is a 3x3 matrix etc. The initialization is left up to you.*
**How close do you get to the parameters above, i.e. how close do you get to the generating parameters in Eq. 3.1?**
Ans: There seems to be something wrong with hmm_c_10000.in as it always converges with the initial vector [0,1,0], otherwise it is almost correct.
**What is the problem when it comes to estimating the distance between these matrices?**
Ans: The hidden states might be different such that a hidden state $X_i$ does not correspond to the corresponding state $X'_i$ in the other matrix.
**How can you solve these issues?** Ans: You can compare the trace of the matricies to set a lower bound for how dissimilar they are.
If you are given a sequence you can test the probability of observations against the model and see if these are significant.
9. *Train an HMM with different numbers of hidden states.*
**What happens if you use more or less than 3 hidden states?** **Why?**
Ans:
- Less (1, 2 states): converged faster, lower probability for training sequence, higher for control sequence
- More (4, 5 states): converged slower, higher probability for training sequence, lower for control sequence
**Are three hidden states and four observations the best choice? If not, why?**
Ans: Yes??? It's similar to the actual matrices
**How can you determine the optimal setting? How does this depend on the amount of data you have?**
Ans: Over-/underfitting - test probability on control sequence
10. *Initialize your Baum-Welch algorithm with a uniform distribution. How does this affect the learning?*
It converges basically immediately (two iterations), $\pi$ and $A$ don't change and $B$ only changes slightly.
**Initialize your Baum-Welch algorithm with a diagonal $A$ matrix and $π = [0, 0, 1]$. How does this affect the learning?**
(B is original B-matrix)
With exactly diagonal matrix: division by zero.
With $1e-5$ instead of zeros: fairly quick divergence. $\pi$ doesn't change. Values for $A$ and $B$ are strange:
pi = [[0.0, 0.0, 1.0]]
A = [[0.75859, 0.0, 0.2414], [0.0, 1.0, 0.0], [0.18836, 0.00421, 0.80743]]
B = [[0.63075, 0.29949, 0.06972, 4e-05],
[0.28294, 0.41594, 0.17457, 0.12654],
[0.02225, 0.20261, 0.27208, 0.50306]]
**Initialize your Baum-Welch algorithm with a matrices that are close to the solution. How does this affect the learning?**
Few iterations to convergence, looks fairly close to true model.