L20-DeepGenerativeModels

# L20-DeepGenerativeModels > Organization contact [name= [ierosodin](ierosodin@gmail.com)] ###### tags: `deep learning` `學習筆記` ==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)== * http://www.deeplearningbook.org/contents/generative_models.html * Generative model with KL-Divergence * Before GAN, many generative models create a model θ that maximizes the Maximum Likelihood Estimation MLE. i.e. finding the best model parameters that fit the training data the most. * $\hat \theta = argmax \prod p(x_i|\theta)$ * This is the same as minimizing the KL-divergence $KL(p,q)$ * $\begin{split}\hat \theta &= argmax \prod p(x_i|\theta) \\ &= argmin - \sum logp(x_i|\theta) \\ &= argmin - E_{x \sim p(x)} p(x)logq(x) \\ &= argmin H(p, q) \\ &= argmin H(p) + KL(p||q) \end{split}$ * KL-divergence is not symmetrical. * The KL-divergence $KL(p, q)$ penalizes the generator if it misses some modes of images: the penalty is high where $p(x) > 0$ but $q(x) → 0$. Nevertheless, it is acceptable that some images do not look real. The penalty is low when $p(x) → 0$ but $q(x) > 0$. (Poorer quality but more diverse samples) * On the other hand, the reverse KL-divergence DL(q, p) penalizes the generator if the images does not look real: high penalty if $p(x) → 0$ but $q(x) > 0$. But it explores less variety: low penalty if $q(x) → 0$ but $p(x) > 0$. * JS-Divergence * JS-divergence is defined as: * $JS(p||q) = \frac{1}{2} KL(p||\frac{p + q}{2}) + \frac{1}{2} KL(q||\frac{p + q}{2})$ * If the discriminator is optimal (performing well in distinguishing images), the generator’s objective function becomes * $min_g -log(4) + 2JS(p_{data}||p_g)$ * Proof: * If $G$ is fixed, the optimal Discriminator $D^*$ is * $D(x) = p_r(x)logD(x) + p_g(x)log(1 - D(x)) \Rightarrow D^*(x) = \frac{p_r(x)}{p_r(x) + p_g(x)}$ * Find the optimal value for V * $\begin{split}&argmin_{p_g}E_{x \sim p_{data}}logD^*_G(x) + E_{x \sim p_g}log(1 - D^*_G(x)) \\ &= argmin_{p_g}E_{x \sim p_{data}}log\frac{p_{data}(x)}{p_{data}(x) + p_g(x)} + E_{x \sim p_g}log\frac{p_g(x)}{p_{data}(x) + p_g(x)} \\ &= argmin_{p_g} - log(4) + KL(p_{data}||\frac{p_{data} + p_g}{2}) + KL(p_g||\frac{p_{data} + p_g}{2}) \\ &= argmin_{p_g} - log(4) + JS(p_{data}||p_g)\end{split}$ * The minimum is when $p_{data} = p_g$ * So optimizing the generator model is treated as optimizing the JS-divergence. * Vanishing gradients in JS-Divergence * Let’s consider an example in which $p$ and $q$ are Gaussian distributed and the mean of $p$ is zero. Let’s consider $q$ with different means to study the gradient of $JS(p, q)$. * ![](https://i.imgur.com/GVvjXLE.png) * Here, we plot the JS-divergence $JS(p, q)$ between $p$ and $q$ with means of $q$ ranging from 0 to 30. As shown below, the gradient for the JS-divergence vanishes from $q$ to $q3$. The GAN generator will learn extremely slow to nothing when cost is saturated in those regions. In particular, in early training, $p$ and $q$ are very different and the generator learns very slow. * ![](https://i.imgur.com/NDFjZo8.png) * mode collapse * 給定了一個 $z$，當 $z$ 發生變化的時候，對應的 $G(z)$ 沒有變化，那麼在這個局部，GAN 就發生了 mode collapse，也就是不能產生不斷連續變化的樣本。這個現像從幾何上來看，就是對應的流型在這個局部點處，沿著不同的切向量方向不再有變化。換言之，所有切向量不再彼此相互獨立--某些切向量要么消失，要么相互之間變得線性相關，從而導致流型的維度在局部出現缺陷（dimension deficient）。 * Wasserstein GAN * 解決GAN訓練不穩定的問題，不再需要小心平衡生成器和判別器的訓練程度 * 基本解決了collapse mode的問題，確保了生成樣本的多樣性 * $\Pi (P_r, P_g)$ 是 $P_r$ 和 $P_g$ 組合起來的所有可能的聯合分佈的集合，反過來說，$\Pi (P_r, P_g)$ 中每一個分佈的邊緣分佈都是 $P_r$ 和 $P_g$。對於每一個可能的聯合分佈 $\gamma$ 而言，可以從中採樣 $(x, y) \sim \gamma$ 得到一個真實樣本 $x$ 和一個生成樣本 $y$，並算出這對樣本的距離 $||x - y||$，所以可以計算該聯合分佈 $\gamma$ 下樣本對距離的期望值 $\mathbb{E}_{(x, y) \sim \gamma} [||x - y||]$。在所有可能的聯合分佈中能夠對這個期望值取到的下界 $\inf_{\gamma \sim \Pi (P_r, P_g)} \mathbb{E}_{(x, y) \sim \gamma} [|| x - y||]$，就定義為Wasserstein距離。 * 直觀上可以把 $\mathbb{E}_{(x, y) \sim \gamma} [||x - y||]$ 理解為在 $\gamma$ 這個 “路徑規劃” 下把 $P_r$ 這堆“沙土”挪到 $P_g$ “位置” 所需的“消耗”，而 $W(P_r, P_g)$ 就是 “最優路徑規劃” 下的 “最小消耗”，所以才叫 Earth-Mover（推土機）距離。 * The EM distance is the cost of the optimal transport plan * $KL$ 散度和 $JS$ 散度是突變的，要么最大要么最小，Wasserstein 距離卻是平滑的，如果我們要用梯度下降法優化 $\theta$ 這個參數，前兩者根本提供不了梯度，Wasserstein 距離卻可以。類似地，在高維空間中如果兩個分佈不重疊或者重疊部分可忽略，則 KL 和 JS 既反映不了遠近，也提供不了梯度，但是 Wasserstein 卻可以提供有意義的梯度。 * ![](https://i.imgur.com/K4Xy25c.png) * $KL(P_1 || P_2) = KL(P_1 || P_2) = \begin{cases} +\infty & \text{if } \theta \neq 0 \\ 0 & \text{if } \theta = 0 \end{cases}$ * $JS(P_1||P_2)= \begin{cases} \log 2 & \text{if } \theta \neq 0 \\ 0 & \text{if } \theta - 0 \end{cases}$ * $W(P_0, P_1) = |\theta|$