<!-- ## Monotonic attention
\begin{align}
\alpha_{i,j}
&=p_{i,j}\sum_{k=1}^{j}\left(\alpha_{i-1,k}\prod_{l=k}^{j-1}(1-p_{i,l})\right) \\
&=p_{i,j}\left(\sum_{k=1}^{j-1}\left(\alpha_{i-1,k}\prod_{l=k}^{j-1}(1-p_{i,l})\right)+\alpha_{i-1,j}\right) \\
&=p_{i,j}\left((1-p_{i,j-1})\sum_{k=1}^{j-1}\left(\alpha_{i-1,k}\prod_{l=k}^{j-2}(1-p_{i,l})\right)+\alpha_{i-1,j}\right) \\
&=p_{i,j}\left((1-p_{i,j-1})\frac{\alpha_{i,j-1}}{p_{i,j-1}}+\alpha_{i-1,j}\right) \\
q_{i,j}
&=\frac{\alpha_{i,j}}{p_{i,j}}=(1-p_{i,j-1})q_{i,j-1}+\alpha_{i-1,j}
\end{align} -->
# SSNT Recursion Derivation
## Forward
Let $p_{i,j}$ be the emission probability, $p(y_j|h_i,s_j)$ be the word prediction probability. Then the forward variable $\alpha_{i,j}$ can be expressed as follows according to eq (9)(12):
$$\alpha_{i,j}=p(y_j|h_i,s_j)p_{i,j}\sum_{k=1}^{i}\left(
\alpha_{k,j-1}\prod_{d=k}^{i-1}(1-p_{d,j})\right)
$$
Take the last entry from the summation ($k=i$) out :
$$
p(y_j|h_i,s_j)p_{i,j}\left(\sum_{k=1}^{i-1}\left(
\alpha_{k,j-1}\prod_{d=k}^{i-1}(1-p_{d,j})\right) + \alpha_{i,j-1}\right)
$$
Take the last entry from the product ($d=i-1$) out :
$$
p(y_j|h_i,s_j)p_{i,j}\left((1-p_{i-1,j})\sum_{k=1}^{i-1}\left(
\alpha_{k,j-1}\prod_{d=k}^{i-2}(1-p_{d,j})\right) + \alpha_{i,j-1}\right)
$$
Recognize that middle summation part can be recursively defined by $\alpha_{i-1,j}$, up to a scaling by the word prediction and the emission, thus:
$$
p(y_j|h_i,s_j)p_{i,j}\left((1-p_{i-1,j})\frac{\alpha_{i-1,j}}{p(y_j|h_{i-1},s_j)p_{i-1,j}} + \alpha_{i,j-1}\right)
$$
Let $q_{i,j}=\frac{\alpha_{i,j}}{p(y_j|h_{i},s_j)p_{i,j}}$, then:
$$
q_{i,j}=(1-p_{i-1,j})q_{i-1,j}+\alpha_{i,j-1}
$$
This allows us to compute $\alpha_{i,j}$ directly from $\alpha_{i-1,j}$ and $\alpha_{i,j-1}$.
### Parallelize
Furthermore, every vector $q_{i}$ can be computed directly from vector $\alpha_{i-1}$ via parallelizable cumulative sum and cumulative product operations. See eq (23)-(30) in [Online and Linear-Time Attention by Enforcing Monotonic Alignments](https://arxiv.org/abs/1704.00784) for the derivation the original formula.
<!-- ### FastEmit
Following the idea of FastEmit, we can modify the recursion to encourage
$$
q_{i,j}=(1-p_{i-1,j})q_{i-1,j}+(1+\lambda)\alpha_{i,j-1}
$$
This in turn yields
$$
q_{j}=cumprod(1-p_{j})cumsum\left(\frac{(1+\lambda)\alpha_{j-1}}{cumprod(1-p_{j})}\right)
$$
## Backward
$$
\beta_{i,j}=\sum_{k=i}^{I}\left(
\left(
\prod_{d=i}^{k-1}(1-p_{d,j+1})p_{k,j+1}
\right)
\beta_{k,j+1}
p(y_{j+1}|h_k,s_{j+1})
\right)
$$
Take the first entry from the summation ($k=i$) out :
$$
p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1})
+
\sum_{k=i+1}^{I}\left(
\left(
\prod_{d=i}^{k-1}(1-p_{d,j+1})
p_{k,j+1}
\right)
\beta_{k,j+1}
p(y_{j+1}|h_k,s_{j+1})
\right)
$$
Take the first entry from the product ($d=i$) out :
$$
p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1})
+
(1-p_{i,j+1})
\sum_{k=i+1}^{I}\left(
\left(
\prod_{d=i+1}^{k-1}(1-p_{d,j+1})
p_{k,j+1}
\right)
\beta_{k,j+1}
p(y_{j+1}|h_k,s_{j+1})
\right)
$$
Recognize that right summation part can be recursively defined by $\beta_{i+1,j}$, thus:
$$
\beta_{i,j}=
p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1})
+
(1-p_{i,j+1})\beta_{i+1,j}
$$
This allows us to compute $\beta_{i,j}$ directly from $\beta_{i+1,j}$ and $\beta_{i,j+1}$. It also allows us to see the similarity between SSNT and RNNT.
### Parallelize
$$
\beta_{i,j}
-p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1})
=(1-p_{i,j+1})\beta_{i+1,j}
$$
$$
\frac{\beta_{i,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
-\frac{p_{i,j+1} \beta_{i,j+1} p(y_{j+1}|h_i,s_{j+1})}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
=\frac{(1-p_{i,j+1})\beta_{i+1,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
$$
$$
\frac{\beta_{i,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
-\frac{\beta_{i,j+1}}{\prod_{k=j+1}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
=\frac{(1-p_{i,j+1})\beta_{i+1,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
$$
$$
\sum_{l=j}^{J-1}
\left(
\frac{\beta_{i,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
-\frac{\beta_{i,l+1}}{\prod_{k=l+1}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
\right)
=\sum_{l=j}^{J-1}
\left(
\frac{(1-p_{i,l+1})\beta_{i+1,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
\right)
$$
$$
\frac{\beta_{i,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
-\frac{\beta_{i,J}}{1}
=\sum_{l=j}^{J-1}
\left(
\frac{(1-p_{i,l+1})\beta_{i+1,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
\right)
$$
$$
\frac{\beta_{i,j}}{\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
-\frac{\beta_{i,J}}{1}
=\sum_{l=j}^{J-1}
\left(
\frac{(1-p_{i,l+1})\beta_{i+1,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
\right)
$$
$$
\beta_{i,j}=
\prod_{k=j}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})
\left(
1+\sum_{l=j}^{J-1}
\left(
\frac{(1-p_{i,l+1})\beta_{i+1,l}}{\prod_{k=l}^{J-1}p_{i,k+1}p(y_{k+1}|h_i,s_{k+1})}
\right)
\right)
$$ -->