# STDS project final submission Origin link: https://hackmd.io/@pRVAMfZxS8yHgEyKfOo_vA/stds-final Guryev Boris: @guru_ruru ## Hidden Markov Model ![](https://i.imgur.com/1dXW0Wp.png) Where **observations** variables: $X = \{x_1, . . . , x_N\}, x_n ∈ R^d$ **Hidden** (binary) variables: $T = \{t_1, . . . , t_N\}, t_n ∈ \{0, 1\}^K$ Assumption: $x_n$ at time $n$ depends only on hidden state $t_n$ which depend only on $t_{n-1}$. #### To define model we need to specify: * $p(x_n | t_n)$ * $p(t_n | t_{n-1})$ * $p(t_1)$ All parameters: $\theta = \{\pi, A, \phi \}$ where $\pi$ - probabilities of the initial state $A$ - transition matrix $\phi$ - mapping of state to distribution parameters $B$ - mapping of state to outcomes probability in cases when observations are discrete and $\phi$ is redundant. #### Possible applications **[[1]](http://www.machinelearning.ru/wiki/images/8/83/GM12_3.pdf)**: 1. **Supervised learning** Given: observations $X = \{x_1, . . . , x_N\}$, hidden states $T = \{t_1, . . . , t_N\}$ Goal: find the most likelyhood $\theta$ 2. **Segmentation (Viterbi algorithm)** Given: observations $X$, parameters $\theta$ Goal: find $T$ such that $arg max_T \ p(T | X,\theta)$ 3. **Unsupervised learning (EM algorithm)** Given: $X$, $K$ - number of states Goal: approximate $\theta$ 4. **Prediction** Given: $x_n$ Goal: predict $x_{n+1}$ or $p(x_{N+1} | X)$ **Example (Unsupervised)** **[[2]](http://www.machinelearning.ru/wiki/images/d/d8/Lecture5.pdf)**: ![](https://i.imgur.com/QIUlXCZ.png) This task very close to Gaussian Mixture Model (GMM) algorithm, but it even takes into account time locality. So this algorithm expressively describes time series models. Results: | Value | Original | Approximated | | -------- | -------- | -------- | | $A$ | ![](https://i.imgur.com/nS3EP4t.png) | ![](https://i.imgur.com/JhLcwv0.png) | $\pi$ | $[0.3, 0.2, 0.5]$ | $[0,0,1]$ | | $\mu$ | $[0,0,1]$ | $[-0.01, 0.1, 1]$ | | $\sigma$ | $[0.1,0.5,0.1]$ | $[0.11,0.51,0.11]$ | **Example (prediction, segmentation)[[3]](https://pdfs.semanticscholar.org/6b12/b9819f9e373a770d87d842ce2a5f5caf2a7a.pdf)** $$\alpha_j(t) = \sum^N_{i=1} \alpha_i(t-1) * a_{ij}$$ where $j$ is responsible for hidden state and $t$ for time (order of observation) This formula as known as *Forward procedure*. It helps to calculate *likelihood* of the next step given all previous. In matrix form it could be expressed (for cases we do not know previous observations) as: $$\alpha(t) = \pi * A^{t-1}$$ This algorithm (Viterbi algorithm) helps to answer the next obvious problem - find the most probable state transition given a set of observations: $$\delta_j(t) = \max_{1<i<N}(\delta_i(t-1)*a_{ij})$$ $$\psi_j(t) = argmax_{1<i<N}(\delta_i(t-1)*a_{ij})$$ The vector $\psi$ represents information about states maximum likelyhood. This method acts in a greedy manner because makes the best decision during the path. For example [beam search](https://en.wikipedia.org/wiki/Beam_search) - a derivative of BFS could be used as an alternative. **Algorithm of training parameters:** To solve problem of HMM parameters training, first of all we need to define some useful formulas: $$\gamma_i(t) = \frac{\alpha_i(t)*\beta_i(t)}{P(O|\theta)}$$ is a *probability of the joint event that $O(1..T)$ are observed and the system is in state $S_i$ at time $t$*. Next, we define $\xi_{ij}(t)$ as the *probability of the joint event that $O(1..T)$ are observed and being in state $S_i$ at time $t$ and making a transition to state $S_j$ at time $t+1$*. $$ \xi_{i j}(t) \quad=\frac{\alpha_{i}(t) \cdot \beta_{j}(t+1) \cdot a_{i j} \cdot b_{j}(O(t+1))}{P(O | \theta)} $$ $$\hat{a}_{ij} = \frac{\sum^W_{v=1}\sum^{T-1}_{t=1}\xi^v_{ij}(t)}{\sum^W_{v=1}\sum^{T-1}_{t=1}\gamma^v_{ij}(t)}$$ $$\hat{b}_{j}(k) = \frac{\sum^W_{v=1}\sum^{T}_{t=1, O^v(t)=k}\gamma^v_{j}(t)}{\sum^W_{v=1}\sum^{T}_{t=1}\gamma^v_{j}(t)}$$ $$\hat{\pi}_i = \frac{1}{W} \sum^W_{v=1}\gamma^v_i(1)$$ Where $W$ denotes the number of words, $O$ is a sequence of observations, ### Speech recognition with HMM **[[3]](https://pdfs.semanticscholar.org/6b12/b9819f9e373a770d87d842ce2a5f5caf2a7a.pdf)** Sound could be viewed as time series, so to precess it the common practice is: divide sequence to chunks (windows) of equal length (with or without *overlap*), extract features (in paper *Slope*, *$\Delta$Slope* and *$\Delta$Energy* were used) from each chunk (time-domain/frequent domain/discrete domain). These features are our *observations*. The section below will be described in case of recognition of words of 10 digits. Each word consists of a sequence of chunks, so there are different approaches to handle multi-dimensional HMM training case: 1. To use the HMM model for each metric and then fuse collected predictions. 2. Adopt formulas described above for a multidimensional case (one summation loop added). We following the 1st approach: Once we have the 10 (trained) models $\theta_d$, one for each spoken digit $d$, recognition consists of running the forward algorithm. This means that we take the 10 HMM's one by one, and run the algorithm on the observation sequence $O(1..T)$. These 10 HMM's will give as a result the probabilities $p(O|\theta_d)$, indicating how well the observation sequence can be explained by the model. Now we can pick the model that has the highest probability, and score the spoken word with the digit, which was the origin of that model. To fuse probabilities of our metrics (denoted as $S$, $\Delta S$ and $\Delta E$) we can compute: $$ p(O|\theta_d) = p(O^S|\theta^S_d) * p(O^{\Delta S} | \theta^{\Delta S}_d) * p(O^{\Delta E} | \theta^{\Delta E}_d) $$ Source code and results for this project could be found here [[5]](https://github.com/BorisAnimal/speech-recognition). In our case, we were interested in the common audio dataset of fruit words, because authors of the paper used a custom dataset consisted of 10 Dutch digits words. #### Related to thesis Almost the same as described above did gentlemen from paper **[[6]](http://sci-hub.se/https://dl.acm.org/citation.cfm?doid=3267305.3267515)**. They trained HMM for each label and from sequence i.e. ![](https://i.imgur.com/arpApgz.png) decided, which model could high probably generate this sequence. Note: they did hierarchical windows - they classified 5-sec windows but evaluated window of length 60-sec, so they needed a probabilistic model to make a straight decision algorithm. [THE END.](https://i.imgur.com/SQsQC8P.jpg) ## References 1. [HMM Lecture #3 slides](http://www.machinelearning.ru/wiki/images/8/83/GM12_3.pdf) 2. [HMM Lecture #5 slides](http://www.machinelearning.ru/wiki/images/d/d8/Lecture5.pdf) 3. [Markov models and their application in speech recognition](https://pdfs.semanticscholar.org/6b12/b9819f9e373a770d87d842ce2a5f5caf2a7a.pdf) 4. [Beam search](https://en.wikipedia.org/wiki/Beam_search) 5. [Source code](https://github.com/BorisAnimal/speech-recognition) 6. [Applying Multiple Knowledge to Sussex-Huawei Locomotion Challenge](http://sci-hub.se/https://dl.acm.org/citation.cfm?doid=3267305.3267515)