SSNT Recursion Derivation

Forward

Let

p_{i, j}

be the emission probability,

p (y_{j} | h_{i}, s_{j})

be the word prediction probability. Then the forward variable

α_{i, j}

can be expressed as follows according to eq (9)(12):

α_{i, j} = p (y_{j} | h_{i}, s_{j}) p_{i, j} \sum_{k = 1}^{i} (α_{k, j - 1} \prod_{d = k}^{i - 1} (1 - p_{d, j}))

Take the last entry from the summation (

k = i

) out :

p (y_{j} | h_{i}, s_{j}) p_{i, j} (\sum_{k = 1}^{i - 1} (α_{k, j - 1} \prod_{d = k}^{i - 1} (1 - p_{d, j})) + α_{i, j - 1})

Take the last entry from the product (

d = i - 1

) out :

p (y_{j} | h_{i}, s_{j}) p_{i, j} ((1 - p_{i - 1, j}) \sum_{k = 1}^{i - 1} (α_{k, j - 1} \prod_{d = k}^{i - 2} (1 - p_{d, j})) + α_{i, j - 1})

Recognize that middle summation part can be recursively defined by

α_{i - 1, j}

, up to a scaling by the word prediction and the emission, thus:

p (y_{j} | h_{i}, s_{j}) p_{i, j} ((1 - p_{i - 1, j}) \frac{α_{i - 1, j}}{p (y_{j} | h_{i - 1}, s_{j}) p_{i - 1, j}} + α_{i, j - 1})

Let

q_{i, j} = \frac{α_{i, j}}{p (y_{j} | h_{i}, s_{j}) p_{i, j}}

, then:

q_{i, j} = (1 - p_{i - 1, j}) q_{i - 1, j} + α_{i, j - 1}

This allows us to compute

α_{i, j}

directly from

α_{i - 1, j}

and

α_{i, j - 1}

Furthermore, every vector

q_{i}

can be computed directly from vector

α_{i - 1}

via parallelizable cumulative sum and cumulative product operations. See eq (23)-(30) in Online and Linear-Time Attention by Enforcing Monotonic Alignments for the derivation the original formula.