## 05.06
### Decode $\mathbf{y}$ given $\mathbf{x}$ and $\mathbf{z}$
#### Default ('soft NLL')
\begin{align}
E(\mathbf{\hat{y}}_1,\ldots,\mathbf{\hat{y}}_T)&=-\sum_{t=1}^T \mathbf{\text{softmax}(f_\theta(\mathbf{\hat{y}}_{<t}, \mathbf{x}))\cdot\log\text{softmax}(\hat{y}}_t)
\end{align}
#### Constrained decoding
\begin{align}
E_{constrain}(\mathbf{\hat{y}}_{1:T})&=E(\mathbf{\hat{y}}_{1:T})+\sum_{\mathbf{y}\in \mathcal{Y}}\min_{t\in {1\ldots T}}\text{KL}(\mathbf{y}\| \hat{\mathbf{y}}_t)
\end{align}
- where $\mathcal{Y}$ is a set of tokens that we want to appear in the decoded sequence, and each $\mathbf{y}$ is a one-hot representation of a token in $\mathcal{Y}$.
#### Counterfactual decoding
\begin{align}
E_{counterfactual}(\mathbf{\hat{y}}_{1:T})&=E(\mathbf{\hat{y}}_{1:T})+\sum_{t=1}^{T}\text{KL}(\mathbf{z}_{t}\| \hat{\mathbf{y}}_t)
\end{align}
- where $\mathcal{Z}$ is original story ending, $\mathbf{z}_t$ is the $t$-th token in $\mathcal{Z}$, and ${T}$ is the length of $\mathcal{Z}$.
#### Combination decoding
\begin{align}
E_{combination}(\mathbf{\hat{y}}_{1:T})&=E(\mathbf{\hat{y}}_{1:T})+\sum_{\mathbf{y}\in \mathcal{Y}}\min_{t\in {1\ldots T}}\text{KL}(\mathbf{y}\| \hat{\mathbf{y}}_t) +\sum_{t=1}^{T}\text{KL}(\mathbf{z}_{t}\| \hat{\mathbf{y}}_t)
\end{align}
- where $\mathcal{Y}$ is some key tokens from $\mathcal{Z}$.