# DD2380 HMM 1. *This problem can be formulated in matrix form. Please specify the initial probability vector π, the transition probability matrix A and the observation probability matrix B.* $$\pi = [0.5, 0.5]$$ $$ A = \begin{pmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{pmatrix} $$ $$ B = \begin{pmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{pmatrix} $$ 2. *First, we need to multiply the transition matrix with our current estimate of states. What is the result of this operation?* Next estimated likelihood of states. 3. *In the following, the result must be multiplied with the observation matrix. What is the result of this operation?* Next estimated likelihood of observations. 4. *Why is it valid to substitute $O1:t = o1:t$ with $Ot = ot$ when we condition on the state $Xt = xi$?* Markov property: if we know $X_t$, then $O_t$ is conditionally independent of $O_1:t-1$ 5. *How many values are stored in the matrices $δ$ and $δ^{idx}$ respectively?* $\delta: T \times N$ $\delta^{idx}: (T-1) \times N$ 6. *Why do we need to divide by the sum over the final α values for the di-gamma function?* $$P(X_t = x_i, X_{t+1} = x_j | O_{1:T} = o_{1:T}) =$$ $$= \frac{P(X_t = x_i, X_{t+1} = x_j \wedge O_{1:T} = o_{1:T})}{P(O_{1:T} = o_{1:T})}$$ Sum of $\alpha_{T}$ is $\sum_{j=1}^N \alpha_T(j) = P(O_{1:T}=o_{1:T})$ (eq. 2.15) which is why it is the denominator. 7. *Train an HMM with the same parameter dimensions as above, i.e. A should be a 3 times 3 matrix, etc. Initialize your algorithm with the following matrices:* **Does the algorithm converge?** Ans: Yes **How many observations do you need for the algorithm to converge?** Ans: $\approx 450$ **How can you define convergence?** Ans: We can define convergence as $P(\mathcal{O}|\lambda_i) - P(\mathcal{O}|\lambda_{i+1}) \to 0$ as $i \to \infty$. 8. *Train an HMM with the same parameter dimensions as above, i.e. A is a 3x3 matrix etc. The initialization is left up to you.* **How close do you get to the parameters above, i.e. how close do you get to the generating parameters in Eq. 3.1?** Ans: There seems to be something wrong with hmm_c_10000.in as it always converges with the initial vector [0,1,0], otherwise it is almost correct. **What is the problem when it comes to estimating the distance between these matrices?** Ans: The hidden states might be different such that a hidden state $X_i$ does not correspond to the corresponding state $X'_i$ in the other matrix. **How can you solve these issues?** Ans: You can compare the trace of the matricies to set a lower bound for how dissimilar they are. If you are given a sequence you can test the probability of observations against the model and see if these are significant. 9. *Train an HMM with different numbers of hidden states.* **What happens if you use more or less than 3 hidden states?** **Why?** Ans: - Less (1, 2 states): converged faster, lower probability for training sequence, higher for control sequence - More (4, 5 states): converged slower, higher probability for training sequence, lower for control sequence **Are three hidden states and four observations the best choice? If not, why?** Ans: Yes??? It's similar to the actual matrices **How can you determine the optimal setting? How does this depend on the amount of data you have?** Ans: Over-/underfitting - test probability on control sequence 10. *Initialize your Baum-Welch algorithm with a uniform distribution. How does this affect the learning?* It converges basically immediately (two iterations), $\pi$ and $A$ don't change and $B$ only changes slightly. **Initialize your Baum-Welch algorithm with a diagonal $A$ matrix and $π = [0, 0, 1]$. How does this affect the learning?** (B is original B-matrix) With exactly diagonal matrix: division by zero. With $1e-5$ instead of zeros: fairly quick divergence. $\pi$ doesn't change. Values for $A$ and $B$ are strange: pi = [[0.0, 0.0, 1.0]] A = [[0.75859, 0.0, 0.2414], [0.0, 1.0, 0.0], [0.18836, 0.00421, 0.80743]] B = [[0.63075, 0.29949, 0.06972, 4e-05], [0.28294, 0.41594, 0.17457, 0.12654], [0.02225, 0.20261, 0.27208, 0.50306]] **Initialize your Baum-Welch algorithm with a matrices that are close to the solution. How does this affect the learning?** Few iterations to convergence, looks fairly close to true model.