Try   HackMD

SSNT Recursion Derivation

Forward

Let

pi,j be the emission probability,
p(yj|hi,sj)
be the word prediction probability. Then the forward variable
αi,j
can be expressed as follows according to eq (9)(12):
αi,j=p(yj|hi,sj)pi,jk=1i(αk,j1d=ki1(1pd,j))

Take the last entry from the summation (
k=i
) out :
p(yj|hi,sj)pi,j(k=1i1(αk,j1d=ki1(1pd,j))+αi,j1)

Take the last entry from the product (
d=i1
) out :
p(yj|hi,sj)pi,j((1pi1,j)k=1i1(αk,j1d=ki2(1pd,j))+αi,j1)

Recognize that middle summation part can be recursively defined by
αi1,j
, up to a scaling by the word prediction and the emission, thus:
p(yj|hi,sj)pi,j((1pi1,j)αi1,jp(yj|hi1,sj)pi1,j+αi,j1)

Let
qi,j=αi,jp(yj|hi,sj)pi,j
, then:
qi,j=(1pi1,j)qi1,j+αi,j1

This allows us to compute
αi,j
directly from
αi1,j
and
αi,j1
.

Parallelize

Furthermore, every vector

qi can be computed directly from vector
αi1
via parallelizable cumulative sum and cumulative product operations. See eq (23)-(30) in Online and Linear-Time Attention by Enforcing Monotonic Alignments for the derivation the original formula.