<!-- ## Monotonic attention \begin{align} \alpha_{i,j} &=p_{i,j}\sum_{k=1}^{j}\left(\alpha_{i-1,k}\prod_{l=k}^{j-1}(1-p_{i,l})\right) \\ &=p_{i,j}\left(\sum_{k=1}^{j-1}\left(\alpha_{i-1,k}\prod_{l=k}^{j-1}(1-p_{i,l})\right)+\alpha_{i-1,j}\right) \\ &=p_{i,j}\left((1-p_{i,j-1})\sum_{k=1}^{j-1}\left(\alpha_{i-1,k}\prod_{l=k}^{j-2}(1-p_{i,l})\right)+\alpha_{i-1,j}\right) \\ &=p_{i,j}\left((1-p_{i,j-1})\frac{\alpha_{i,j-1}}{p_{i,j-1}}+\alpha_{i-1,j}\right) \\ q_{i,j} &=\frac{\alpha_{i,j}}{p_{i,j}}=(1-p_{i,j-1})q_{i,j-1}+\alpha_{i-1,j} \end{align} --> # SSNT Recursion Derivation ## Forward Let $p_{i,j}$ be the emission probability, $p(y_j|h_i,s_j)$ be the word prediction probability. Then the forward variable $\alpha_{i,j}$ can be expressed as follows according to eq (9)(12): $$\alpha_{i,j}=p(y_j|h_i,s_j)p_{i,j}\sum_{k=1}^{i}\left( \alpha_{k,j-1}\prod_{d=k}^{i-1}(1-p_{d,j})\right) $$ Take the last entry from the summation ($k=i$) out : $$ p(y_j|h_i,s_j)p_{i,j}\left(\sum_{k=1}^{i-1}\left( \alpha_{k,j-1}\prod_{d=k}^{i-1}(1-p_{d,j})\right) + \alpha_{i,j-1}\right) $$ Take the last entry from the product ($d=i-1$) out : $$ p(y_j|h_i,s_j)p_{i,j}\left((1-p_{i-1,j})\sum_{k=1}^{i-1}\left( \alpha_{k,j-1}\prod_{d=k}^{i-2}(1-p_{d,j})\right) + \alpha_{i,j-1}\right) $$ Recognize that middle summation part can be recursively defined by $\alpha_{i-1,j}$, up to a scaling by the word prediction and the emission, thus: $$ p(y_j|h_i,s_j)p_{i,j}\left((1-p_{i-1,j})\frac{\alpha_{i-1,j}}{p(y_j|h_{i-1},s_j)p_{i-1,j}} + \alpha_{i,j-1}\right) $$ Let $q_{i,j}=\frac{\alpha_{i,j}}{p(y_j|h_{i},s_j)p_{i,j}}$, then: $$ q_{i,j}=(1-p_{i-1,j})q_{i-1,j}+\alpha_{i,j-1} $$ This allows us to compute $\alpha_{i,j}$ directly from $\alpha_{i-1,j}$ and $\alpha_{i,j-1}$. ### Parallelize Furthermore, every vector $q_{i}$ can be computed directly from vector $\alpha_{i-1}$ via parallelizable cumulative sum and cumulative product operations. See eq (23)-(30) in [Online and Linear-Time Attention by Enforcing Monotonic Alignments](https://arxiv.org/abs/1704.00784) for the derivation the original formula. <!-- ### FastEmit Following the idea of FastEmit, we can modify the recursion to encourage $$ q_{i,j}=(1-p_{i-1,j})q_{i-1,j}+(1+\lambda)\alpha_{i,j-1} $$ This in turn yields $$ q_{j}=cumprod(1-p_{j})cumsum\left(\frac{(1+\lambda)\alpha_{j-1}}{cumprod(1-p_{j})}\right) $$ ## Backward $$ \beta_{i,j}=\sum_{k=i}^{I}\left( \left( \prod_{d=i}^{k-1}(1-p_{d,j+1})p_{k,j+1} \right) \beta_{k,j+1} p(y_{j+1}|h_k,s_{j+1}) \right) $$ Take the first entry from the summation ($k=i$) out : $$ p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1}) + \sum_{k=i+1}^{I}\left( \left( \prod_{d=i}^{k-1}(1-p_{d,j+1}) p_{k,j+1} \right) \beta_{k,j+1} p(y_{j+1}|h_k,s_{j+1}) \right) $$ Take the first entry from the product ($d=i$) out : $$ p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1}) + (1-p_{i,j+1}) \sum_{k=i+1}^{I}\left( \left( \prod_{d=i+1}^{k-1}(1-p_{d,j+1}) p_{k,j+1} \right) \beta_{k,j+1} p(y_{j+1}|h_k,s_{j+1}) \right) $$ Recognize that right summation part can be recursively defined by $\beta_{i+1,j}$, thus: $$ \beta_{i,j}= p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1}) + (1-p_{i,j+1})\beta_{i+1,j} $$ This allows us to compute $\beta_{i,j}$ directly from $\beta_{i+1,j}$ and $\beta_{i,j+1}$. It also allows us to see the similarity between SSNT and RNNT. ### Parallelize $$ \beta_{i,j} -p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1}) =(1-p_{i,j+1})\beta_{i+1,j} $$ $$ \frac{\beta_{i,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} -\frac{p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1})}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} =\frac{(1-p_{i,j+1})\beta_{i+1,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} $$ $$ \frac{\beta_{i,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} -\frac{\beta_{i,j+1}}{\prod_{k=j+1}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} =\frac{(1-p_{i,j+1})\beta_{i+1,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} $$ $$ \sum_{l=j}^{J-1} \left( \frac{\beta_{i,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} -\frac{\beta_{i,l+1}}{\prod_{k=l+1}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} \right) =\sum_{l=j}^{J-1} \left( \frac{(1-p_{i,l+1})\beta_{i+1,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} \right) $$ $$ \frac{\beta_{i,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} -\frac{\beta_{i,J}}{1} =\sum_{l=j}^{J-1} \left( \frac{(1-p_{i,l+1})\beta_{i+1,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} \right) $$ $$ \frac{\beta_{i,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} -\frac{\beta_{i,J}}{1} =\sum_{l=j}^{J-1} \left( \frac{(1-p_{i,l+1})\beta_{i+1,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} \right) $$ $$ \beta_{i,j}= \prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1}) \left( 1+\sum_{l=j}^{J-1} \left( \frac{(1-p_{i,l+1})\beta_{i+1,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})} \right) \right) $$ -->