# L16-GraphicalModels
> Organization contact [name= [ierosodin](ierosodin@gmail.com)]
###### tags: `deep learning` `學習筆記`
==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)==
* http://www.deeplearningbook.org/contents/graphical_models.html
* In general, if we wish to model a distribution over a random vector $x$ containing $n$ discrete variables capable of taking on $k$ values each, then the naive approach ofrepresenting $P(x)$ by storing a lookup table with one probability value per possible outcome requires $k^n$ parameters!
* The Challenge of Unstructured Modeling
* Memory
* the cost of storing the representation
* Statistical efficiency
* As the number of parameters in a model increases, so does the amount of training data needed to choose the values of those parameters using a statistical estimator.
* Runtime
* the cost of inference
* Runtime
* the cost of sampling
* Directed Models
* directed graphical model, directed acyclic graph (DAG), belief network or Bayesian network
* A directed graphical model defined on variables $x$ is defined by adirected acyclic graph $G$ whose vertices are the random variables in the model, and a set of local conditional probability distributions $p(x_i| P a\mathcal{G}(x_i))$, where $P a\mathcal{G}(x_i)$ gives the parents of $x_i$ in $\mathcal{G}$. The probability distribution over $x$ is given by
* $p(x) = \prod p(x_i| P a\mathcal{G}(x_i))$
* Now suppose we build a directed graphical model over these variables. If $m$ is the maximum number of variables appearing (on either side of the conditioning bar) in a single conditional probability distribution, then the cost of the tables for the directed model scales like $O(k^m)$.
* In other words, as long as each variable has few parents in the graph, the distribution can be represented with very few parameters.
* D-seperation
* dependence seperation
* two variablesare dependent if there is an active path between them and d-separated if no such path exists.
* 
* $a$ and $b$ are independent (d-separated) given $s$
* $p(a, b|s) = \frac{p(a)p(s|a)p(b|s)}{p(s)} = p(a|s)p(b|s)$
* 
* $a$ and $b$ are independent (d-separated) given $s$
* $p(a, b|s) = \frac{p(s)p(a|s)p(b|s)}{p(s)} = p(a|s)p(b|s)$
* 
* $a$ and $b$ are dependent given $s$
* The head-to-head rule can generalize to the case where a descendant of $s$ is observed
* Meeting either head-to-tail or tail-to-tail at a node in C, or
* Meeting head-to-head at a node, and neither the node, nor any of its descendant, is in C.
* Undirected Models
* Markov random field (MRFs) or Markov networks
* Not all situations we might want to model have such a clear direction to their interactions. When the interactions seem to have no intrinsic direction, or to operate in both directions, it may be more appropriate to use an undirected model.
* An undirected graphical model is defined on an undirected graph $\mathcal{G}$ and factorizes the joint distribution of its node variables as a product of potential functions $\phi(C)$ over the maximum cliques $C$ of the graph
* 
* A clique is a subset of the nodes in a graph $\mathcal{G}$ in which there exists a link between every pair of nodes in the subset.
* A maximum clique $C$ is a clique such that it is not possible to include any other nodes in the graph without ceasing to be a clique.
* The clique potential $\phi$ measures the affinity of its member variables in each of their possible joint states
* One choice for $\phi$ is the energy-based model (Boltzmann distribution)
* $\phi(C) = e^{-E(x_c)}$
* Separation
* Separation refers to the conditional independencies implied by the undirected graph
* Given A, B, C are three non-intersecting sets of nodes, A and B are conditionally independent (separated) given C if all paths from any node in A to any node in B pass through one or more nodes in C
* 
* Conversion between Directed and Undirected Models
* 
* In the present case, the potential function $\phi$ is given by
* $\phi(x) = p(a)p(b)p(c|a, b)$
* Directed models are able to use one specific kind of substructure that undirected models cannot represent perfectly. This substructure is called an immorality.The structure occurs when two random variables a and b are both parents of a third random variable c, and there is no edge directly connecting a and b in either direction.
* 
* A directed graph D cannot capture all theconditional independences implied by an undirected graph U if U contains a loop of length greater than three, unless that loop also contains achord.
* If U has loops of length four or greater and does not have chords for these loops, we must add the chords before we can convert it to a directed model. Adding these chords discards some of the independence information that was encoded in U.
* 
* Restricted Boltzmann Machines (RBM)
* The RBM is not itself a deep model. Instead, it has a single layerof latent variables that may be used to learn a representation for the input.
* An energy-based model with binary visible and hidden units
* 
* where b, c, and W are unconstrained, real valued, learnable parameters.
* There is no direct interaction between visible units or between hidden units (essentially, a bipartite graph)
* The restrictions on the RBM structure yield the nice properties $p(v|h) = \prod p(v_i | h)$ and $p(h|v) = \prod p(h_i | v)$
* For the binary RBMwe obtain
* $p(h = 1|v) = \sigma(v^TW + c)$
* $p(h = 0|v) = 1 - \sigma(v^TW + c)$
* $p(v = 1|h) = \sigma(Wh + b)$
* $p(v = 0|h) = 1 - \sigma(Wh + b)$
* these properties allow for efficient block Gibbs sampling, which alter-nates between sampling all of h simultaneously and sampling all of v simultaneously.
* the RBM demonstrates the typical deep learning approach to graphical models: representation learning accomplished via layers of latent variables, combined with efficient interactions between layers parametrized by matrices.
* 
* 
* 
* Label Noise Model
* Modeling conditional distributions with deep neural networks in a graphical model that describes generation of noisy labels
* 
* 
* 