# L20-DeepGenerativeModels
> Organization contact [name= [ierosodin](ierosodin@gmail.com)]
###### tags: `deep learning` `學習筆記`
==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)==
* http://www.deeplearningbook.org/contents/generative_models.html
* Generative model with KL-Divergence
* Before GAN, many generative models create a model θ that maximizes the Maximum Likelihood Estimation MLE. i.e. finding the best model parameters that fit the training data the most.
* $\hat \theta = argmax \prod p(x_i|\theta)$
* This is the same as minimizing the KL-divergence $KL(p,q)$
* $\begin{split}\hat \theta &= argmax \prod p(x_i|\theta) \\
&= argmin - \sum logp(x_i|\theta) \\
&= argmin - E_{x \sim p(x)} p(x)logq(x) \\
&= argmin H(p, q) \\
&= argmin H(p) + KL(p||q)
\end{split}$
* KL-divergence is not symmetrical.
* The KL-divergence $KL(p, q)$ penalizes the generator if it misses some modes of images: the penalty is high where $p(x) > 0$ but $q(x) → 0$. Nevertheless, it is acceptable that some images do not look real. The penalty is low when $p(x) → 0$ but $q(x) > 0$. (Poorer quality but more diverse samples)
* On the other hand, the reverse KL-divergence DL(q, p) penalizes the generator if the images does not look real: high penalty if $p(x) → 0$ but $q(x) > 0$. But it explores less variety: low penalty if $q(x) → 0$ but $p(x) > 0$.
* JS-Divergence
* JS-divergence is defined as:
* $JS(p||q) = \frac{1}{2} KL(p||\frac{p + q}{2}) + \frac{1}{2} KL(q||\frac{p + q}{2})$
* If the discriminator is optimal (performing well in distinguishing images), the generator’s objective function becomes
* $min_g -log(4) + 2JS(p_{data}||p_g)$
* Proof:
* If $G$ is fixed, the optimal Discriminator $D^*$ is
* $D(x) = p_r(x)logD(x) + p_g(x)log(1 - D(x)) \Rightarrow D^*(x) = \frac{p_r(x)}{p_r(x) + p_g(x)}$
* Find the optimal value for V
* $\begin{split}&argmin_{p_g}E_{x \sim p_{data}}logD^*_G(x) + E_{x \sim p_g}log(1 - D^*_G(x)) \\
&= argmin_{p_g}E_{x \sim p_{data}}log\frac{p_{data}(x)}{p_{data}(x) + p_g(x)} + E_{x \sim p_g}log\frac{p_g(x)}{p_{data}(x) + p_g(x)} \\
&= argmin_{p_g} - log(4) + KL(p_{data}||\frac{p_{data} + p_g}{2}) + KL(p_g||\frac{p_{data} + p_g}{2}) \\
&= argmin_{p_g} - log(4) + JS(p_{data}||p_g)\end{split}$
* The minimum is when $p_{data} = p_g$
* So optimizing the generator model is treated as optimizing the JS-divergence.
* Vanishing gradients in JS-Divergence
* Let’s consider an example in which $p$ and $q$ are Gaussian distributed and the mean of $p$ is zero. Let’s consider $q$ with different means to study the gradient of $JS(p, q)$.
* 
* Here, we plot the JS-divergence $JS(p, q)$ between $p$ and $q$ with means of $q$ ranging from 0 to 30. As shown below, the gradient for the JS-divergence vanishes from $q$ to $q3$. The GAN generator will learn extremely slow to nothing when cost is saturated in those regions. In particular, in early training, $p$ and $q$ are very different and the generator learns very slow.
* 
* mode collapse
* 給定了一個 $z$,當 $z$ 發生變化的時候,對應的 $G(z)$ 沒有變化,那麼在這個局部,GAN 就發生了 mode collapse,也就是不能產生不斷連續變化的樣本。這個現像從幾何上來看,就是對應的流型在這個局部點處,沿著不同的切向量方向不再有變化。換言之,所有切向量不再彼此相互獨立--某些切向量要么消失,要么相互之間變得線性相關,從而導致流型的維度在局部出現缺陷(dimension deficient)。
* Wasserstein GAN
* 解決GAN訓練不穩定的問題,不再需要小心平衡生成器和判別器的訓練程度
* 基本解決了collapse mode的問題,確保了生成樣本的多樣性
* $\Pi (P_r, P_g)$ 是 $P_r$ 和 $P_g$ 組合起來的所有可能的聯合分佈的集合,反過來說,$\Pi (P_r, P_g)$ 中每一個分佈的邊緣分佈都是 $P_r$ 和 $P_g$。對於每一個可能的聯合分佈 $\gamma$ 而言,可以從中採樣 $(x, y) \sim \gamma$ 得到一個真實樣本 $x$ 和一個生成樣本 $y$,並算出這對樣本的距離 $||x - y||$,所以可以計算該聯合分佈 $\gamma$ 下樣本對距離的期望值 $\mathbb{E}_{(x, y) \sim \gamma} [||x - y||]$。在所有可能的聯合分佈中能夠對這個期望值取到的下界 $\inf_{\gamma \sim \Pi (P_r, P_g)} \mathbb{E}_{(x, y) \sim \gamma} [|| x - y||]$,就定義為Wasserstein距離。
* 直觀上可以把 $\mathbb{E}_{(x, y) \sim \gamma} [||x - y||]$ 理解為在 $\gamma$ 這個 “路徑規劃” 下把 $P_r$ 這堆“沙土”挪到 $P_g$ “位置” 所需的“消耗”,而 $W(P_r, P_g)$ 就是 “最優路徑規劃” 下的 “最小消耗”,所以才叫 Earth-Mover(推土機)距離。
* The EM distance is the cost of the optimal transport plan
* $KL$ 散度和 $JS$ 散度是突變的,要么最大要么最小,Wasserstein 距離卻是平滑的,如果我們要用梯度下降法優化 $\theta$ 這個參數,前兩者根本提供不了梯度,Wasserstein 距離卻可以。類似地,在高維空間中如果兩個分佈不重疊或者重疊部分可忽略,則 KL 和 JS 既反映不了遠近,也提供不了梯度,但是 Wasserstein 卻可以提供有意義的梯度。
* 
* $KL(P_1 || P_2) = KL(P_1 || P_2) = \begin{cases} +\infty & \text{if } \theta \neq 0 \\ 0 & \text{if } \theta = 0 \end{cases}$
* $JS(P_1||P_2)= \begin{cases} \log 2 & \text{if } \theta \neq 0 \\ 0 & \text{if } \theta - 0 \end{cases}$
* $W(P_0, P_1) = |\theta|$