Casual disentanglement literature review

# Casual disentanglement literature review visualization of IRS: https://github.com/google-research/disentanglement_lib/blob/master/disentanglement_lib/visualize/visualize_irs.py IRS code: https://github.com/google-research/disentanglement_lib/tree/master/disentanglement_lib/evaluation/metrics [VACA: Design of Variational Graph Autoencoders for Interventional and Counterfactual Queries](https://arxiv.org/pdf/2110.14690.pdf) Similar to CausalVAE but used GNNs instead. --- [**CANDLE dataset**](https://github.com/causal-disentanglement/candle-simulator) 6 data generating factors along with both observed and unobserved confounders. ![](https://i.imgur.com/0Nhqfl5.png) Unobserved confounding: 1. interaction between the artificial light source AND the scene’s natural lighting conditions of producing shadows. 2. location of the object and its size with depth Examples: ![](https://i.imgur.com/vPB63qa.png) ![](https://i.imgur.com/G0qeOLu.png) Each value of a factor of variation corresponds to a separate .blend file in a hierarchy. We can also modify or replace the existing assets with Blender. --- [Disentangling Disentanglement in Variational Autoencoders](https://arxiv.org/pdf/1812.02833.pdf) **Decomposition in VAEs:** 1. the latent encodings of data having an appropriate level of overlap: *find for more complex datasets that require more richly structured dependencies to be able to encode the information required to generate higher dimensional data.* 2. the aggregate encoding of data $q_{\phi}(z)$ conforming to a desired structure $p(z)$, represented through the prior. **Deconstruction of $\beta$-VAE:** $\mathcal{L}_\beta (x) = \mathbb{E}_{q_{\phi}(z|x)} [-\log p_\theta(x|z)] - \beta KL(q_{\phi}(z|x) || p(z))$ allows the value of $\beta$ to affect the overlap of encodings in the latent space. **Fact:** 1. Relationship between the ELBO in VAE and ELBO in $\beta$-VAE if prior is gaussian. 2. ELBO of $\beta$-VAE is invariant under the rotation of $\theta, \phi$. Therefore we do not see any improvement in disentanglement by improving $\beta$ **Objective for enforcing disentanglement:** $\mathcal{L}_{\beta,\alpha} (x) = \mathbb{E}_{q_{\phi}(z|x)} [-\log p_\theta(x|z)] - \beta KL(q_{\phi}(z|x) || p(z)) - \alpha \mathbb{D}(q_\phi(z),p(z))$ --- [A Framework for the Quantitative Evaluation of Disentangled](https://openreview.net/forum?id=By-7dz-AZ) A framework for the quantitative evaluation of disentangled representations when the ground-truth latent structure is available. Steps: 1. Train M on a synthetic dataset with generative factors z 2. compact data representation $c = M(x)$ for each sample $x$. 3. $\hat{z} = f(c)$. Train regressor $f$ to predict z given c * Lasso regression * matrix of relative importances $R$ s.t. $R_{ij}$ denotes the relative importance of $c_i$ in predicting $z_j$: random forest regressor.The number of times a tree chooses to split on a particular input variable determines its importance to the prediction. 4. quantify deviation of $\hat{z}$ from $z$. * Disentanglement score $D_i$: $P_{ij} = \frac{R_{ij}}{\sum_{k = 1}^{K-1}R_{ik}}, H_K(P_i) = \sum_{k = 1}^{K-1} P_{ik} \log_K P_{ik}, D_i = 1 - H_K(P_i)$. To account for the irrelevant units in $c$, weight is constructed as $\rho_i = \sum_j R_{ij}/\sum_{ij} R_{ij}$. If a $c_i$ is irrelevant for predicting z, then its $\rho_i$ will be near zero. * Completeness score $C_i$: The degree to which each underlying factor is captured by a single code variable. If all $c$ contribute equally to $z_j$'s prediction, the score will be 0. <font color="#f00">(why)</font> * Informativeness: Error$(z_j,\hat{z}_j)$. --- [beta-VAE](https://openreview.net/forum?id=Sy2fzU9gl) **Definition of disentanglement:** A disentangled representation can be defined as one where single latent units are sensitive to changes in single generative factors, while being relatively invariant to changes in other factors. Images are generated by $p(x|v,w) = Sim(v,w)$. WANT: $p(x|z) \approx p(x|v,w)$ where $z$ are generative latent factors. **Difference with VAE loss function:** $\mathcal{L}(\theta, \phi, \beta; \mathbb{x}, \mathbb{z}) = E_{q_\phi(z|x)}\{\log p_\theta(x|z)\} - \beta D_{KL}(q_\phi(z|x)||p(z)$ **Disentanglement metric**: independence and interpretability Run inference on a number of images that are generated by fixing the value of one data generative factor while randomly sampling all others. $z_{1,l} = \mu(x_{1,l}), z_{2,l} = \mu(x_{2,l})$ using encoder. Then average to predict $p(y | z^z_{\text{diff}})$. Report accuracy as disentangement metric score. --- [Representation Learning: A Review and New Perspectives](https://arxiv.org/abs/1206.5538) --- [Robustly Disentangled Causal Mechanisms](https://arxiv.org/abs/1811.00007) 1. Unifying causal framework of dientangled generative processes and consequent feature encodings. *Interventional robustness score* * **Disentangled Causal process**: assuming that the dimensionality of the confounding $L$ less than the number of factors $K$ $C \leftarrow N_c$ $G_i \leftarrow f_i(PA^C_i, N_i)), PA^C_i \subset \{C_1,...,C_L\}$ $X \leftarrow g(G,N_x)$ * **Properties of a Disentangled Causal Process:** (i) $p(x|g)$ invariant to changes in the distributions $p(g_i)$ (ii) In general, $G_i \not\!\perp\!\!\!\perp G_j$, but $G_i \perp\!\!\!\perp G_j|C$ (iii) $G_i \not\!\perp\!\!\!\perp G_j|X$ (iv) The latent factors $G$ contain all information about confounders $C$ that is relevant to $X$ (v) no total causal effect from $G_j$ to $G_i$ for $j \neq i$ (vi) $G_{\backslash j}$ estimate interventional effects from $G_j$ to $X$ based on observational data * **Interventional robustness** For generality, we will henceforth talk about robustness of groups of features $Z_L$ with respect to interventions on groups of generative factors $G_J$. 2. new visualisation technique which provides an intuitive understanding of dependency structures and robustness of learned encodings. --- [CausalVAE](https://arxiv.org/abs/2004.08697) A causal layer is added to the conventional VAE: $z = A^Tz + \epsilon = (I + A^T)^{-1}\epsilon, \epsilon \sim N(0,I)$ $A$ is a parameter to be learnt. Goal: learn distribution $q_\phi(z, \epsilon|x,u)$ where $u$ is the additional information associated with the true causal concepts as supervising signals. Learning object: ![](https://i.imgur.com/azsZa0n.png) ![](https://i.imgur.com/Dhn8u50.png) --- [On Causally Disentangled Representations](https://ojs.aaai.org/index.php/AAAI/article/view/20781) Much of the existing disentanglement literature relies on the assumption that generative factors are independent of each other, and do not consider a causal view to the generating process. **Properties of causal disentanglement in latent variable models**: ![](https://i.imgur.com/qwd5ZqO.png) 1. Property 1: in a disentangled causal process, $G_i$ does not casually influence $G_j$. 2. If encoder $e$ learns a latent space $\mathbf{Z}$ such that each generative factor $G_i$ is mapped to a unique $\mathbf{Z}_I$, then the generator g is a disentangled causal mechanism that models the underlying generative process. (view it in terms of the generator instead of encoder) 3. The only causal feature of $\hat{x}$ w.r.t. generative factor $G_i$ is $\mathbf{Z}_I, \forall i$. **Evaluation metrics and dataset**: * Unconfoundedness If a model is able to map each $G_i$ to a unique $\mathbf{Z}_I$,we say that the learned latent space $\mathbf{Z}$ is unconfounded. * Counterfactual Generativeness Intervening on the latents corresponding to the background should only change the background and intervening on the latents corresponding to texture or shape of the ball should not change the background * Dataset: CANDLE