# L13-LinearFactorModels
> Organization contact [name= [ierosodin](ierosodin@gmail.com)]
###### tags: `deep learning` `學習筆記`
==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)==
* http://www.deeplearningbook.org/contents/linear_factors.html
* probabilistic models withlatent variables
* A linear factor model is defined by the use of a stochastic linear decoder functionthat generates x by adding noise to a linear transformation of h.
* A linear factor model describes the data-generation process
* sample the explanatory factors z from a factorial distribution $z \sim p(z)$
* sample the real-valued observable variables given thefactors $x = Wz + \mu + noise$
* the noise is typically Gaussian and diagonal (independent across dimensions)
* 
* Probabilistic PCA (principal components analysis)
* special case of the above equations
* In factor analysis(Bartholomew, 1987; Basilevsky, 1994), the latent variable prior is just the unit variance Gaussian.
* $z \sim N(z; 0, I)$
* the observed variables $x_i$ are assumed to be conditionally independent, given $z$.
* the noise is assumed to be drawn from a diagonal co-variance Gaussian distribution, with covariance matrix $\psi = diag(\sigma^2)$
* The role of the latent variables is thus to capture the dependencies betweenthe different observed variables $x_i$.
* Because $x = Wz + \mu + noise$, $x \sim N(x; \mu, WW^T + \psi)$.
* If we make the conditional variances $\sigma^2$ iequal to each other, $x \sim N(x; \mu, WW^T + \sigma^2 I)$.
* Derivation:
* If we write multivariate Gaussian in partitioned form, $\left[
\begin{array}{c}
x \\
h \\
\end{array}
\right] \sim N(\left[
\begin{array}{c}
\mu_x \\
\mu_h \\
\end{array}
\right], \left[
\begin{array}{cc}
\Sigma_{xx} & \Sigma_{xh} \\
\Sigma_{hx} & \Sigma_{hh} \\
\end{array}
\right])$, then the marginal distribution $p(x)$ (integrating over $z$) is given by $x \sim N(\mu_x, \Sigma_{xx})$
* $\begin{split}\Sigma_{xx} &= E((x - \mu)(x - \mu)^T) = E((Wz + \sigma I)(Wz + \sigma I)^T) \\
&= E(Wzz^TW^T) + E(\sigma^2I) = WW^T + \sigma^2I\end{split}$
* 
* $x = Wz + \mu + \sigma h$, where $h$ is Gaussian noise $h \sim N(h; 0, I)$.
* $p(x) = N(x; \mu, WW^T + \sigma^2 I)$
* $p(x, z) = p(z)p(x|z)$, so that $p(z|x) \propto p(z)p(x|z)$
$\begin{split}p(z|x) \sim p(z)p(x|z) &= cexp\{-\frac{1}{2}z^Tz - \frac{1}{2\sigma^2}{||Wz + \mu - x||}^2\} \\
&\sim exp{\{\sigma^{-2}{||Wz + \mu - x||}^2 + z^Tz\}} \\
&= exp\{\sigma^{-2}\lgroup (Wz + \mu)^T(Wz + \mu - 2(Wz + \mu)^Tx + x^Tx \rgroup + z^Tz\} \\
& \begin{split} =exp\{&\sigma^{-2}\lgroup z^TW^TWz + 2z^TW^T\mu + \mu^T\mu - 2z^TW^Tx - 2\mu^Tx + x^Tx \rgroup + \\
&z^Tz\} \end{split} \\
&= exp\{z^T(\sigma^{-2}W^TW + I)z -2z^T\lgroup \sigma^{-2}W^T(x - \mu) \rgroup + const\}
\end{split}$
* We could find that $p(z|x)$ is still in gaussian form.
* Compare exponatial term with gaussian quadratic form:
$(z - \bar \mu)^T\Sigma^{-1}(z - \bar \mu) = z^T\Sigma^{-1}z - 2z^T\Sigma^{-1}\bar \mu + \bar \mu^T \bar \mu$
$\therefore \Sigma^{-1} = \sigma^{-2}W^TW + I = \sigma^{-2}(W^TW + \sigma^{2}I)$
Let $M = W^TW + \sigma^{2}I \Rightarrow \Sigma = \sigma^{2}M^{-1}$
$\bar \mu = \Sigma\lgroup \sigma^{-2}W^T(x - \mu) \rgroup = M^{-1}W^T(x - \mu)$
$\Rightarrow p(z|x) = N(z; M^{-1}W^T(x - \mu), \sigma^{2}M^{-1})$, where $M = W^TW + \sigma^{2}I$
* we use a low-dimension latent variable to represent a high-dimension data distribusion.
* we can use an iterative EM algorithm for estimating the parameters $W$ and $\sigma^2$.
* probabilistic PCA becomes PCA as $\sigma$ -> 0
* EM for Probabilistic PCA
* $p(x, z) = p(x|z)p(z)$
* 
* 
* $Tr$ is the sum of the diagonal elements.
* Maximum Likelihood PCA
* Standard PCA
* 
* 
* 
* 
* In the other view, we also want to minimum mean square error, and we found that this is the same thing.
* 
* And this can become an eigen problem
* maximum $w^TSw$ under the constraint that $||w|| = w^Tw = 1$
* by Lagrange multiplier, maximizing $w^TSw - \lambda (w^Tw - 1)$
* $Sw - \lambda w = 0$ -> eigen problem
* Standard PCA vs. Probabilistic PCA
* 
* 
* 
* Manifold Interpretation of PCA
* Linear factor models, such as PCA, can be interpreted as learning a low-dimensional manifold
* Probabilistic PCA learns a pancake-shaped manifold of high probability
* Standard PCA learns a hyperplane specified by $Wz + x$
* The Expectation Maximization (EM) Algorithm
* Why EM work
* 在解說為何 EM 會成功以前,我們先介紹一個工具:
* Jensen’s inequality
* 
* 上圖說明了,當今天一個函數為 convex 時,${f(\sum x_i) \leq \sum f(x_i)}$
* 反之,當今天一個函數為 concave 時,則 ${f(\sum x_i) \geq \sum f(x_i)}$
*
* EM algorithm for incomplete data
* 回到我們的 EM algorithm,首先必須知道,${log}$ function 屬於 concave function,因此觀察我們的 function J:
* ${J\ =\ \sum P(X, z_i|\theta)}$
* 然而在 incomplete data 中,我們並沒有 ${z_i}$ 的資訊,因此我們假設一個未知分佈 ${q(z_i)}$,式子變成:
* ${log\sum q(z_i)\frac{P(X, z_i|\theta)}{q(z_i)}}$
* 又因為 ${log}$ function 屬於 concave function:
* ${\begin{split}&log\left(\Sigma q(z_i)\frac{P(X, z_i|\theta)}{q(z_i)}\right) \geq \sum q(z_i)log\left(\frac{P(X, z_i|\theta)}{q(z_i)}\right) \\
=\ &\sum q(z_i)log\left(\frac{P(z_i|X, \theta)P(X|\theta)}{q(z_i)}\right) \\
=\ &\sum q(z_i)log\left(P(X|\theta)\right)\
+\ \sum q(z_i)log\left(\frac{P(z_i|X, \theta)}{q(z_i)}\right) \\
=\ &\sum q(z_i)log\left(P(X|\theta)\right)\
-\ \sum q(z_i)log\left(\frac{q(z_i)}{P(z_i|X, \theta)}\right)\end{split}}$
* 可以發現,${\sum q(z_i)log\left(\frac{q(z_i)}{P(z_i|X, \theta)}\right)}$ 其實就是 ${KL(q||p)}$,而 ${q(z_i)}$ 其實就是 ${w_i}$。
* 因此每次我們在做 E step 的時候,其實就是在 minimum ${KL(q||p)}$(還記得 ${KL(q||p)\ =\ 0}$ 的情況為:${q}$ 與 ${p}$ 的分佈相同)。
* 所以在 E step,我們計算 ${w_i}$ 就是使 ${w_i}$ 與當下 ${P_0,\ P_1,\ \lambda}$ 所產生的硬幣分佈是相同的。
* 而在 M step 的過程,其實就是在 ${\sum P(X, z_i|\theta)\ =\ \sum q(z_i)log\left(P(X|\theta)\right)}$ 的情況下,找到 MLE 的 ${P_0,\ P_1,\ \lambda}$(或者說這三項的 weight likelihood)
* 又因為改分佈後,${q}$ 與 ${p}$ 的分佈又會不同,因此回到 E step 重新找 ${KL\ =\ 0}$ 的 ${w_i}$,不斷的重複。
* 而每次更新 ${P_0,\ P_1,\ \lambda}$ 都只會讓 lower bound 更高,因此只會更好,不會更遭,直到收斂為止。
* 