L16-GraphicalModels

# L16-GraphicalModels > Organization contact [name= [ierosodin](ierosodin@gmail.com)] ###### tags: `deep learning` `學習筆記` ==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)== * http://www.deeplearningbook.org/contents/graphical_models.html * In general, if we wish to model a distribution over a random vector $x$ containing $n$ discrete variables capable of taking on $k$ values each, then the naive approach ofrepresenting $P(x)$ by storing a lookup table with one probability value per possible outcome requires $k^n$ parameters! * The Challenge of Unstructured Modeling * Memory * the cost of storing the representation * Statistical eﬃciency * As the number of parameters in a model increases, so does the amount of training data needed to choose the values of those parameters using a statistical estimator. * Runtime * the cost of inference * Runtime * the cost of sampling * Directed Models * directed graphical model, directed acyclic graph (DAG), belief network or Bayesian network * A directed graphical model deﬁned on variables $x$ is deﬁned by adirected acyclic graph $G$ whose vertices are the random variables in the model, and a set of local conditional probability distributions $p(x_i| P a\mathcal{G}(x_i))$, where $P a\mathcal{G}(x_i)$ gives the parents of $x_i$ in $\mathcal{G}$. The probability distribution over $x$ is given by * $p(x) = \prod p(x_i| P a\mathcal{G}(x_i))$ * Now suppose we build a directed graphical model over these variables. If $m$ is the maximum number of variables appearing (on either side of the conditioning bar) in a single conditional probability distribution, then the cost of the tables for the directed model scales like $O(k^m)$. * In other words, as long as each variable has few parents in the graph, the distribution can be represented with very few parameters. * D-seperation * dependence seperation * two variablesare dependent if there is an active path between them and d-separated if no such path exists. * ![](https://i.imgur.com/vfU8i8W.png) * $a$ and $b$ are independent (d-separated) given $s$ * $p(a, b|s) = \frac{p(a)p(s|a)p(b|s)}{p(s)} = p(a|s)p(b|s)$ * ![](https://i.imgur.com/RkadFiL.png) * $a$ and $b$ are independent (d-separated) given $s$ * $p(a, b|s) = \frac{p(s)p(a|s)p(b|s)}{p(s)} = p(a|s)p(b|s)$ * ![](https://i.imgur.com/nnrbRip.png) * $a$ and $b$ are dependent given $s$ * The head-to-head rule can generalize to the case where a descendant of $s$ is observed * Meeting either head-to-tail or tail-to-tail at a node in C, or * Meeting head-to-head at a node, and neither the node, nor any of its descendant, is in C. * Undirected Models * Markov random field (MRFs) or Markov networks * Not all situations we might want to model have such a clear direction to their interactions. When the interactions seem to have no intrinsic direction, or to operate in both directions, it may be more appropriate to use an undirected model. * An undirected graphical model is defined on an undirected graph $\mathcal{G}$ and factorizes the joint distribution of its node variables as a product of potential functions $\phi(C)$ over the maximum cliques $C$ of the graph * ![](https://i.imgur.com/HLsvHkk.png) * A clique is a subset of the nodes in a graph $\mathcal{G}$ in which there exists a link between every pair of nodes in the subset. * A maximum clique $C$ is a clique such that it is not possible to include any other nodes in the graph without ceasing to be a clique. * The clique potential $\phi$ measures the affinity of its member variables in each of their possible joint states * One choice for $\phi$ is the energy-based model (Boltzmann distribution) * $\phi(C) = e^{-E(x_c)}$ * Separation * Separation refers to the conditional independencies implied by the undirected graph * Given A, B, C are three non-intersecting sets of nodes, A and B are conditionally independent (separated) given C if all paths from any node in A to any node in B pass through one or more nodes in C * ![](https://i.imgur.com/3a5GlFw.png) * Conversion between Directed and Undirected Models * ![](https://i.imgur.com/woMHwkq.png) * In the present case, the potential function $\phi$ is given by * $\phi(x) = p(a)p(b)p(c|a, b)$ * Directed models are able to use one speciﬁc kind of substructure that undirected models cannot represent perfectly. This substructure is called an immorality.The structure occurs when two random variables a and b are both parents of a third random variable c, and there is no edge directly connecting a and b in either direction. * ![](https://i.imgur.com/tefdxKC.png) * A directed graph D cannot capture all theconditional independences implied by an undirected graph U if U contains a loop of length greater than three, unless that loop also contains achord. * If U has loops of length four or greater and does not have chords for these loops, we must add the chords before we can convert it to a directed model. Adding these chords discards some of the independence information that was encoded in U. * ![](https://i.imgur.com/cdZrZFM.png) * Restricted Boltzmann Machines (RBM) * The RBM is not itself a deep model. Instead, it has a single layerof latent variables that may be used to learn a representation for the input. * An energy-based model with binary visible and hidden units * ![](https://i.imgur.com/r5vVrji.png) * where b, c, and W are unconstrained, real valued, learnable parameters. * There is no direct interaction between visible units or between hidden units (essentially, a bipartite graph) * The restrictions on the RBM structure yield the nice properties $p(v|h) = \prod p(v_i | h)$ and $p(h|v) = \prod p(h_i | v)$ * For the binary RBMwe obtain * $p(h = 1|v) = \sigma(v^TW + c)$ * $p(h = 0|v) = 1 - \sigma(v^TW + c)$ * $p(v = 1|h) = \sigma(Wh + b)$ * $p(v = 0|h) = 1 - \sigma(Wh + b)$ * these properties allow for eﬃcient block Gibbs sampling, which alter-nates between sampling all of h simultaneously and sampling all of v simultaneously. * the RBM demonstrates the typical deep learning approach to graphical models: representation learning accomplished via layers of latent variables, combined with eﬃcient interactions between layers parametrized by matrices. * ![](https://i.imgur.com/KhXNOls.png) * ![](https://i.imgur.com/0DlcU2g.png) * ![](https://i.imgur.com/KZzpUnj.png) * Label Noise Model * Modeling conditional distributions with deep neural networks in a graphical model that describes generation of noisy labels * ![](https://i.imgur.com/7Ud6Z00.png) * ![](https://i.imgur.com/ZJSOYCQ.png) * ![](https://i.imgur.com/gRspwQ1.png)