Scaling Up Probabilistic Circuits by Latent Variable Distillation

# [Scaling Up Probabilistic Circuits by Latent Variable Distillation](https://openreview.net/pdf?id=067CGykiZTS) ## Outline * Introduction * Latent Variable Distillation for Hidden Markov Model * Latent Variable Distillation for Probabilistic Circuits * Efficient Parameter Learning * Extracting Latent Variables for Image Modeling * Experiments * Conclusion ## Introduction - One key challenge is to **scale Probabilistic Circuits(PCs)** to model large and high dimensional real-world datasets: - We observe that as the number of parameters in PCs increases, their performance immediately plateaus. - The existing optimizers fail to utilize the expressive power provided by large PCs. - We propose to overcome such bottleneck by **latent variable distillation(LVD)**: - LVD leverages deep generative models to **induce semantics-aware assignments** to the latent variables of probabilistic circuits (PCs) to provide extra supervision. ______________________ - The LVD pipeline consists of two major components: - (i) Inducing assignments to a subset of latent variables in a PC by information obtained from deep generative models. - We cluster training examples based on their neural embeddings and assign the same values to latent variables for examples in the same cluster. - There is no constraint on how we should assign values to latent variables and the methodology may be engineered depending on the nature of the dataset and the architecture of PC and deep generative model. - (ii) Estimating PC parameters given the latent variable assignments. - To leverage the supervision provided by the latent variable assignments obtained in (i). - We estimate PC parameters by optimizing the its lower-bound shown on the right-hand side. $$ \begin{aligned} \sum_{i=1}^{N} \log p\left(\boldsymbol{x}^{(i)}\right):=\sum_{i=1}^{N} \log \sum_{\boldsymbol{z}} p\left(\boldsymbol{x}^{(i)}, \boldsymbol{z}\right) \geq \sum_{i=1}^{N} \log p\left(\boldsymbol{x}^{(i)}, \boldsymbol{z}^{(i)}\right) \end{aligned} $$ ______________________ ![](https://hackmd.io/_uploads/H149li-ep.png) ______________________ ## Latent Variable Distillation for Hidden Markov Model ![](https://hackmd.io/_uploads/BJLe-iZxp.png) - Dataset & Model. - WikiText-2. - The HMM model will only be trained on subsequences of length 32 and whenever predicting the next token. - We pre-process the tokens from WikiText-2 by concatenating them into one giant token sequence. - We collect all subsequences of length 32 to construct the train, validation and test sets, respectively. - Latent Variable Distillation. - If the BERT embeddings of some suffixes are close to each other then the suffixes should be relatively similar, suggesting that they should be “generated” by the same hidden state. - We obtain an “augmented” training set $\begin{aligned}& \mathcal{D}_{\mathrm{aug}}=\left\{\left(\boldsymbol{x}^{(i)}, \boldsymbol{z}^{(i)}\right)\right\}_{i}\end{aligned}$ - Using $θ^∗$(The parameters of the HMM) as a starting point, we finetune the HMM model via EM to maximize the true MLE objective $\sum_i \log p\left(\boldsymbol{x}^{(i)}\right)$. ______________________ ## Latent Variable Distillation for Probabilistic Circuits ![](https://hackmd.io/_uploads/r14Y8Pze6.png) ### Probabilistic Circuits: a General TPM Framwork * Definition 1 (Probabilistic Circuits) * A PC $p(X)$ that encodes a **distribution** over variables $X$ is defined by a parameterized directed acyclic computation graph (DAG) with a single root node $n_r$. $$ p_{n}(\boldsymbol{x}):= \begin{cases}f_{n}(\boldsymbol{x}) & \text { if } n \text { is an input unit, } \\ \sum_{c \in \operatorname{in}(n)} \theta_{c \mid n} \cdot p_{c}(\boldsymbol{x}) & \text { if } n \text { is a sum unit, } \\ \prod_{c \in \operatorname{in}(n)} p_{c}(\boldsymbol{x}) & \text { if } n \text { is a product unit, }\end{cases} $$ * A key property that separates PCs from many other generative models is their *tractability*. * For example, *smoothness* and *decomposability* together guarantee linear time computation of arbitrary marginal probabilities. * Definition 2(smoothness and decomposability) * Define the variable scope $ϕ(n)$ of a PC unit $n$ as the set of variables defined by all its descendent input units. * A PC is **smooth** if for every sum unit $n$, all its children have the same scope: $∀c_1, c_2∈in(n), ϕ(c_1)=ϕ(c_2)$. * A PC is **decomposable** if the children of every product unit $n$ have disjoint scopes: $∀c_1, c_2∈in(n)(c_1 \ne c_2), ϕ(c_1) ∩ ϕ(c_2)=∅$. ______________________ ### Materializing and Distilling Latent Variables in Probabilistic Circuits - Materializing latent variables means inducing semantics-aware assignments to the latent variables of PCs using a deep generative model. - We can obtain a new PC representing the joint distribution of the observed and latent variables, where the latent variables are explicit. - The LVD pipeline consists of three major steps: - **Step 1: Materializing Latent Variables.** - ![](https://hackmd.io/_uploads/H19s-hDl6.png) - ![](https://hackmd.io/_uploads/ry_hZnwg6.png) - Each assignment to X, Z uniquely determines the input distribution to choose, where the other inputs give zero probability under this assignment; this sum unit is deterministic. - By materializing more latent variables, we make PCs “more deterministic”, pushing the optimization procedure towards a closed-form estimation. - **Step 2: Inducing Latent Variable Assignments.** - Latent variable materialization itself cannot provide any extra supervision to the PC training pipeline. - We cluster the suffix embeddings generated by the BERT model and for each training example. - **Step 3: PC Parameter Learning.** 1. We obtain an augmented PC $p_{aug}(X, Z; θ)$ whose marginal distribution on $X$ corresponds to $p(X; θ)$ 2. By leveraging some deep generative model $\mathcal{G}$, we obtain an augmented training set $\mathcal{D}_{aug} = {(x^{(i)}, z^{(i)})}$. $$ \sum_{i=1}^{N} \log p\left(\boldsymbol{x}^{(i)} ; \theta\right)=\sum_{i=1}^{N} \log \sum_{\boldsymbol{z}} p_{\text {aug }}\left(\boldsymbol{x}^{(i)}, \boldsymbol{z} ; \theta\right) \geq \sum_{i=1}^{N} \log p_{\text {aug }}\left(\boldsymbol{x}^{(i)}, \boldsymbol{z}^{(i)} ; \theta\right) ; $$ 3. We initialize $p$ with $θ^∗$ and optimize the true MLE objective with respect to the original dataset $\mathcal{D},\sum_{i=1}^{N} \log p\left(\boldsymbol{x}^{(i)} ; \theta\right)$. - Summary: Assume that we are given: a **PC** $p(X; θ)$ over observed variables $X$ with parameter $θ$, a training set $\mathcal{D} = \{x^{(i)}\}$ and a deep generative model $\mathcal{G}$. 1. Construct a **PC** $p_{aug}(X, Z; θ)$ by materializing a subset of latent variables $Z$ in $p(X; θ)$; note that $p$ and $p_{aug}$ share the same parameter space. 2. Use $\mathcal{G}$ to induce **semantics-aware** latent variable assignments $z^{(i)}$ for each training example $x^{(i)}$; denote the augmented dataset as $\mathcal{D}_{aug} = {(x^{(i)}, z^{(i)})}$. 3. Optimize the log-likelihood of $p_{\text {aug }}$ w/ respect to $\mathcal{D}_{\text {aug }}$, i.e., $\sum_{i} \log p_{\text {aug }}\left(\boldsymbol{x}^{(i)}, \boldsymbol{z}^{(i)} ; \theta\right)$; denote the parameters for $p_{\text {aug }}$ after optimization as $\theta^{*}$. 4. Initialize $p(\mathbf{X}, \theta)$ with $\theta^{*}$ and then optimize the $\log$-likelihood of $p$ w/ respect to the original dataset $\mathcal{D}$, i.e., $\sum_{i} \log p\left(\boldsymbol{x}^{(i)} ; \theta\right)$. ______________________ ## Efficient Parameter Learning - Optimizing the MLE lower bound with regard to the observed data and inferred LVs **requires feeding all training samples through the whole PC**. - By exploiting the additional conditional independence assumptions introduced by the materialized LVs. - **Lemma 1**. *For a $PC$ $p(X)$, denote $W$ as the scope of some units in $p$. Assume the variable scope of every $PC$ unit is either a subset of $W$ or disjoint with $W$. Let $Z$ be the LV corresponds to $W$ created by Algorithm 1. **Then variables $W$ are conditional independent of $X\setminus W$ given $Z$***. - The key to speed up LVD is the observation that the MLE lower bound objective can be factored into independent components following the decomposition of $p(x, z)$: $$ \begin{aligned} & \operatorname{LL}\left(p, \mathcal{D}_{\text {aug }}\right):=\sum_{l=1}^{N} \log p\left(\boldsymbol{x}^{(l)}, \boldsymbol{z}^{(l)}\right)=\sum_{l=1}^{N} \sum_{i=1}^{k} \log p\left(\boldsymbol{x}_{i}^{(l)} \mid z_{i}^{(l)}\right)+\sum_{l=1}^{N} \log p\left(\boldsymbol{z}^{(l)}\right), \end{aligned} $$ - For each cluster $i$ and category $j$, optimizing PC parameters w.r.t. the distribution $p(X_i |Z_i =j)$ using the subset of training samples whose LV $z_i$ is assigned to category $j$. - Optimizing the sub-PC corresponds to $p(z)$ using the set of all LV assignments. ![](https://hackmd.io/_uploads/Hy3wvbul6.png) ______________________ ## Extracting Latent Variables for Image Modeling - We model images by two levels of hierarchy - the low-level models independently encode distribution of every image patch. - the top-level model represents the correlation between different patches. ![](https://hackmd.io/_uploads/rJCyKb_l6.png) - We choose to use Masked Autoencoders (MAEs) as they produce good features for image patches. - We first compute the latent features without context. - We then compute the correlation between features of all pair of patches and construct the Maximum Spanning Tree (MST) using the pairwise correlations. - Finally, to compute the feature of each patch $X_i$, we additionally input patches correspond to its ancestors in the MST. ______________________ ## Experiments ![](https://hackmd.io/_uploads/HyQP0Z_gp.png) ______________________ ## Conclusion - We propose to tackle this problem by **latent variable distillation**: a general framework for training probabilistic circuit that provides **extra supervision** over their latent spaces by distilling information from existing deep generative models. - It boosts the performance of large probabilistic circuits on challenging benchmarks for both image and language modeling. ## Appendix ![](https://hackmd.io/_uploads/HyvmyM_lT.png) - Why optimizing lower-bound - By optimizing the lower-bound objective, we can incorporate the guidance provided by the deep generative model and obtain better estimates of the parameters of the PC. - Furthermore, optimizing the lower-bound objective allows us to leverage the supervision from distilled latent variable assignments, which significantly speeds up the training pipeline and opens up possibilities for further scaling up PCs. - PC - Definition 1 (Probabilistic Circuit). Given a set of random variables $\mathbf{X}$, a probabilistic circuit $(P C) \mathcal{P}$ is a tuple $(\mathcal{G}, \psi)$, where $\mathcal{G}$, denoted as computational graph, is a directed acyclic graph $(D A G)(V, E)$ and $\psi: V \mapsto 2^{\mathbf{X}}$, denoted as scope function, is a function assigning a scope to each node in $V$, i.e. a sub-set of $\mathbf{X}$. For internal nodes of $\mathcal{G}$, i.e. any node $\mathrm{N} \in V$ which has children, the scope function satisfies $\psi(\mathrm{N})=\cup_{\mathrm{N}^{\prime} \in \operatorname{ch}(\mathrm{N})} \psi\left(\mathrm{N}^{\prime}\right)$. A leaf of $\mathcal{G}$ computes a probability density over its scope $\psi(\mathrm{L})$. All internal nodes of $\mathcal{G}$ are either sum nodes (S) or product nodes (P). A sum node $\mathrm{S}$ computes a convex combination of its children, i.e. $\mathrm{S}=\sum_{\mathrm{N} \in \mathbf{c h}(\mathrm{N})} w_{\mathrm{S}, \mathrm{N}} \mathrm{N}$, where $\sum_{\mathrm{N} \in \operatorname{ch}(\mathrm{N})} w_{\mathrm{S}, \mathrm{N}}=1$, and $\forall \mathrm{N} \in \operatorname{ch}(\mathrm{S}): w_{\mathrm{S}, \mathrm{N}} \geq 0$. A product node $\mathrm{P}$ computes a product of its children, i.e. $\mathrm{P}=\prod_{\mathrm{N} \in \operatorname{ch}(\mathrm{N})} \mathrm{N}$.