# Denoising DF
#### Binary Data
Given a binary input $x \in \{0, 1\}^2$ with two components $x=[x_1, x_2]$, we train a neural network $n$ to model $x_2$ from the input $x_1$. The output of the neural network, i.e. $n(x_1)$, is in $[0,1]$ and can be seen as a probability of $x_2 =1$ given $x_1$.
Now we can define the coupling layer output $y = [y_1, y_2]$ as $y_1 = x_1$ and using a conditional bitflip operation
\begin{align}
y_2 =
\begin{cases}
bitflip(x_2)\quad\quad \text{if} \ \ n(x_1) \geq 0.5\\
x_2\quad \quad\quad\quad\quad\ \ \text{otherwise}
\end{cases}
\end{align}
That means that if our model $n$ is correctly tuned to the data, that $y_2 = 0$ always with probability $\geq 0.5$. This will have made the data more predictable and easier to model by a factorized model.
### Extension to Categorical Data
We have as input a categorical one-hot vector $x \in \{0,1\}^{K\times2} : \sum_{k=1}^K x_{ki} = 1$ with again two components $x=[x_1, x_2]$. Just like we did for the binary data, we train a neural network $n$ to model $x_2$ from the input $x_1$.
The output of the neural network $n(x_1) = o$ is a K-dimensional vector with $o \in [0,1]^K : \sum_{k=1}^K o_{k} = 1$. This is achieved with applying a softmax in the last layer of $n$. We can interpret $o$ as a categorical probability distribution $p(x_2|n(x_1))$.
After $n$ is trained, we create a denoising coupling layer. To this end, we use the network output $o= n(x_1)$ to construct a permutation matrix $O$ of size $KxK$ that we then multiply with $x_2$ to obtain $y_2$.
$$y_2 = x_2 \cdot O$$
In the $K=2$ case, we can have either
1. $O = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$, i.e. the identity matrix or
2. $O = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$, i.e. the bitflip matrix
Recall that $o \in [0,1]^K : \sum_{k=1}^K o_{k} = 1$. and we can interpret $o$ as a categorical probability distribution $p(x_2|n(x_1))$.
If $o_0 \geq 0.5$ (meaning $p(x2=0 | n(x1) \geq 0.5$) then we choose $O$ as the identity matrix, otherwise as the bitflip matrix.
### $K=3$
$o= n(x_1)$ with $o \in [0,1]^K : \sum_{k=1}^K o_{k} = 1$.
1. $O = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix}$, i.e. the identity matrix or
2. $O = \begin{bmatrix} 0 & 1 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1 \end{bmatrix}$, i.e. swap categories 0 and 1
3. $O = \begin{bmatrix} 0 & 0 & 1\\ 0 & 1 & 0\\ 1 & 0 & 0 \end{bmatrix}$, i.e. swap categories 0 and 2
Example:
We have the distribution $P_{x}$ over $x=[x_1, x_2]$:
| $x_1$\ $x_2$ | 0 | 1|2|
| -------- | -------- | -------- | -------- |
| 0 | 0 | 0.1 |0.2 |
| 1 | 0 | 0.2 |0 |
| 2 | 0.1 | 0 |0.4 |