L13-LinearFactorModels

# L13-LinearFactorModels > Organization contact [name= [ierosodin](ierosodin@gmail.com)] ###### tags: `deep learning` `學習筆記` ==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)== * http://www.deeplearningbook.org/contents/linear_factors.html * probabilistic models withlatent variables * A linear factor model is deﬁned by the use of a stochastic linear decoder functionthat generates x by adding noise to a linear transformation of h. * A linear factor model describes the data-generation process * sample the explanatory factors z from a factorial distribution $z \sim p(z)$ * sample the real-valued observable variables given thefactors $x = Wz + \mu + noise$ * the noise is typically Gaussian and diagonal (independent across dimensions) * ![](https://i.imgur.com/Q1Ws0r0.png) * Probabilistic PCA (principal components analysis) * special case of the above equations * In factor analysis(Bartholomew, 1987; Basilevsky, 1994), the latent variable prior is just the unit variance Gaussian. * $z \sim N(z; 0, I)$ * the observed variables $x_i$ are assumed to be conditionally independent, given $z$. * the noise is assumed to be drawn from a diagonal co-variance Gaussian distribution, with covariance matrix $\psi = diag(\sigma^2)$ * The role of the latent variables is thus to capture the dependencies betweenthe diﬀerent observed variables $x_i$. * Because $x = Wz + \mu + noise$, $x \sim N(x; \mu, WW^T + \psi)$. * If we make the conditional variances $\sigma^2$ iequal to each other, $x \sim N(x; \mu, WW^T + \sigma^2 I)$. * Derivation: * If we write multivariate Gaussian in partitioned form, $\left[ \begin{array}{c} x \\ h \\ \end{array} \right] \sim N(\left[ \begin{array}{c} \mu_x \\ \mu_h \\ \end{array} \right], \left[ \begin{array}{cc} \Sigma_{xx} & \Sigma_{xh} \\ \Sigma_{hx} & \Sigma_{hh} \\ \end{array} \right])$, then the marginal distribution $p(x)$ (integrating over $z$) is given by $x \sim N(\mu_x, \Sigma_{xx})$ * $\begin{split}\Sigma_{xx} &= E((x - \mu)(x - \mu)^T) = E((Wz + \sigma I)(Wz + \sigma I)^T) \\ &= E(Wzz^TW^T) + E(\sigma^2I) = WW^T + \sigma^2I\end{split}$ * ![](https://i.imgur.com/IExHI5b.png) * $x = Wz + \mu + \sigma h$, where $h$ is Gaussian noise $h \sim N(h; 0, I)$. * $p(x) = N(x; \mu, WW^T + \sigma^2 I)$ * $p(x, z) = p(z)p(x|z)$, so that $p(z|x) \propto p(z)p(x|z)$ $\begin{split}p(z|x) \sim p(z)p(x|z) &= cexp\{-\frac{1}{2}z^Tz - \frac{1}{2\sigma^2}{||Wz + \mu - x||}^2\} \\ &\sim exp{\{\sigma^{-2}{||Wz + \mu - x||}^2 + z^Tz\}} \\ &= exp\{\sigma^{-2}\lgroup (Wz + \mu)^T(Wz + \mu - 2(Wz + \mu)^Tx + x^Tx \rgroup + z^Tz\} \\ & \begin{split} =exp\{&\sigma^{-2}\lgroup z^TW^TWz + 2z^TW^T\mu + \mu^T\mu - 2z^TW^Tx - 2\mu^Tx + x^Tx \rgroup + \\ &z^Tz\} \end{split} \\ &= exp\{z^T(\sigma^{-2}W^TW + I)z -2z^T\lgroup \sigma^{-2}W^T(x - \mu) \rgroup + const\} \end{split}$ * We could find that $p(z|x)$ is still in gaussian form. * Compare exponatial term with gaussian quadratic form: $(z - \bar \mu)^T\Sigma^{-1}(z - \bar \mu) = z^T\Sigma^{-1}z - 2z^T\Sigma^{-1}\bar \mu + \bar \mu^T \bar \mu$ $\therefore \Sigma^{-1} = \sigma^{-2}W^TW + I = \sigma^{-2}(W^TW + \sigma^{2}I)$ Let $M = W^TW + \sigma^{2}I \Rightarrow \Sigma = \sigma^{2}M^{-1}$ $\bar \mu = \Sigma\lgroup \sigma^{-2}W^T(x - \mu) \rgroup = M^{-1}W^T(x - \mu)$ $\Rightarrow p(z|x) = N(z; M^{-1}W^T(x - \mu), \sigma^{2}M^{-1})$, where $M = W^TW + \sigma^{2}I$ * we use a low-dimension latent variable to represent a high-dimension data distribusion. * we can use an iterative EM algorithm for estimating the parameters $W$ and $\sigma^2$. * probabilistic PCA becomes PCA as $\sigma$ -> 0 * EM for Probabilistic PCA * $p(x, z) = p(x|z)p(z)$ * ![](https://i.imgur.com/yId6sL6.png) * ![](https://i.imgur.com/8mNLKR7.png) * $Tr$ is the sum of the diagonal elements. * Maximum Likelihood PCA * Standard PCA * ![](https://i.imgur.com/6LFfND3.png) * ![](https://i.imgur.com/qtXXgdc.png) * ![](https://i.imgur.com/rKHH7gH.png) * ![](https://i.imgur.com/EGyueEm.png) * In the other view, we also want to minimum mean square error, and we found that this is the same thing. * ![](https://i.imgur.com/WM7Fm8B.png) * And this can become an eigen problem * maximum $w^TSw$ under the constraint that $||w|| = w^Tw = 1$ * by Lagrange multiplier, maximizing $w^TSw - \lambda (w^Tw - 1)$ * $Sw - \lambda w = 0$ -> eigen problem * Standard PCA vs. Probabilistic PCA * ![](https://i.imgur.com/eJbbS5O.png) * ![](https://i.imgur.com/x6EuGIM.png) * ![](https://i.imgur.com/5agBv6T.png) * Manifold Interpretation of PCA * Linear factor models, such as PCA, can be interpreted as learning a low-dimensional manifold * Probabilistic PCA learns a pancake-shaped manifold of high probability * Standard PCA learns a hyperplane specified by $Wz + x$ * The Expectation Maximization (EM) Algorithm * Why EM work * 在解說為何 EM 會成功以前，我們先介紹一個工具： * Jensen’s inequality * ![](https://i.imgur.com/1R9EBYn.png) * 上圖說明了，當今天一個函數為 convex 時，${f(\sum x_i) \leq \sum f(x_i)}$ * 反之，當今天一個函數為 concave 時，則 ${f(\sum x_i) \geq \sum f(x_i)}$ * * EM algorithm for incomplete data * 回到我們的 EM algorithm，首先必須知道，${log}$ function 屬於 concave function，因此觀察我們的 function J： * ${J\ =\ \sum P(X, z_i|\theta)}$ * 然而在 incomplete data 中，我們並沒有 ${z_i}$ 的資訊，因此我們假設一個未知分佈 ${q(z_i)}$，式子變成： * ${log\sum q(z_i)\frac{P(X, z_i|\theta)}{q(z_i)}}$ * 又因為 ${log}$ function 屬於 concave function： * ${\begin{split}&log\left(\Sigma q(z_i)\frac{P(X, z_i|\theta)}{q(z_i)}\right) \geq \sum q(z_i)log\left(\frac{P(X, z_i|\theta)}{q(z_i)}\right) \\ =\ &\sum q(z_i)log\left(\frac{P(z_i|X, \theta)P(X|\theta)}{q(z_i)}\right) \\ =\ &\sum q(z_i)log\left(P(X|\theta)\right)\ +\ \sum q(z_i)log\left(\frac{P(z_i|X, \theta)}{q(z_i)}\right) \\ =\ &\sum q(z_i)log\left(P(X|\theta)\right)\ -\ \sum q(z_i)log\left(\frac{q(z_i)}{P(z_i|X, \theta)}\right)\end{split}}$ * 可以發現，${\sum q(z_i)log\left(\frac{q(z_i)}{P(z_i|X, \theta)}\right)}$ 其實就是 ${KL(q||p)}$，而 ${q(z_i)}$ 其實就是 ${w_i}$。 * 因此每次我們在做 E step 的時候，其實就是在 minimum ${KL(q||p)}$（還記得 ${KL(q||p)\ =\ 0}$ 的情況為：${q}$ 與 ${p}$ 的分佈相同）。 * 所以在 E step，我們計算 ${w_i}$ 就是使 ${w_i}$ 與當下 ${P_0,\ P_1,\ \lambda}$ 所產生的硬幣分佈是相同的。 * 而在 M step 的過程，其實就是在 ${\sum P(X, z_i|\theta)\ =\ \sum q(z_i)log\left(P(X|\theta)\right)}$ 的情況下，找到 MLE 的 ${P_0,\ P_1,\ \lambda}$（或者說這三項的 weight likelihood） * 又因為改分佈後，${q}$ 與 ${p}$ 的分佈又會不同，因此回到 E step 重新找 ${KL\ =\ 0}$ 的 ${w_i}$，不斷的重複。 * 而每次更新 ${P_0,\ P_1,\ \lambda}$ 都只會讓 lower bound 更高，因此只會更好，不會更遭，直到收斂為止。 * ![](https://i.imgur.com/fIi8UR7.png)