Denoising DF - HackMD

# Denoising DF #### Binary Data Given a binary input $x \in \{0, 1\}^2$ with two components $x=[x_1, x_2]$, we train a neural network $n$ to model $x_2$ from the input $x_1$. The output of the neural network, i.e. $n(x_1)$, is in $[0,1]$ and can be seen as a probability of $x_2 =1$ given $x_1$. Now we can define the coupling layer output $y = [y_1, y_2]$ as $y_1 = x_1$ and using a conditional bitflip operation \begin{align} y_2 = \begin{cases} bitflip(x_2)\quad\quad \text{if} \ \ n(x_1) \geq 0.5\\ x_2\quad \quad\quad\quad\quad\ \ \text{otherwise} \end{cases} \end{align} That means that if our model $n$ is correctly tuned to the data, that $y_2 = 0$ always with probability $\geq 0.5$. This will have made the data more predictable and easier to model by a factorized model. ### Extension to Categorical Data We have as input a categorical one-hot vector $x \in \{0,1\}^{K\times2} : \sum_{k=1}^K x_{ki} = 1$ with again two components $x=[x_1, x_2]$. Just like we did for the binary data, we train a neural network $n$ to model $x_2$ from the input $x_1$. The output of the neural network $n(x_1) = o$ is a K-dimensional vector with $o \in [0,1]^K : \sum_{k=1}^K o_{k} = 1$. This is achieved with applying a softmax in the last layer of $n$. We can interpret $o$ as a categorical probability distribution $p(x_2|n(x_1))$. After $n$ is trained, we create a denoising coupling layer. To this end, we use the network output $o= n(x_1)$ to construct a permutation matrix $O$ of size $KxK$ that we then multiply with $x_2$ to obtain $y_2$. $$y_2 = x_2 \cdot O$$ In the $K=2$ case, we can have either 1. $O = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$, i.e. the identity matrix or 2. $O = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$, i.e. the bitflip matrix Recall that $o \in [0,1]^K : \sum_{k=1}^K o_{k} = 1$. and we can interpret $o$ as a categorical probability distribution $p(x_2|n(x_1))$. If $o_0 \geq 0.5$ (meaning $p(x2=0 | n(x1) \geq 0.5$) then we choose $O$ as the identity matrix, otherwise as the bitflip matrix. ### $K=3$ $o= n(x_1)$ with $o \in [0,1]^K : \sum_{k=1}^K o_{k} = 1$. 1. $O = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix}$, i.e. the identity matrix or 2. $O = \begin{bmatrix} 0 & 1 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1 \end{bmatrix}$, i.e. swap categories 0 and 1 3. $O = \begin{bmatrix} 0 & 0 & 1\\ 0 & 1 & 0\\ 1 & 0 & 0 \end{bmatrix}$, i.e. swap categories 0 and 2 Example: We have the distribution $P_{x}$ over $x=[x_1, x_2]$: | $x_1$\ $x_2$ | 0 | 1|2| | -------- | -------- | -------- | -------- | | 0 | 0 | 0.1 |0.2 | | 1 | 0 | 0.2 |0 | | 2 | 0.1 | 0 |0.4 |

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.