# Causal Reasoning using universal priors ## Causal de Finetti world ### ICM generative model A multivariate exchangeable sequence $\{X_{i, n}\}_{i\in\{1,\ldots,D\}, n\in\mathbb{N}}$ is a collection of random variables that is infinitely exchangeable in the variable $n$. An ICM (independent causal mechanism) generative model with directed acyclic graph $\mathcal{G}$ over node set $V = \{1,\ldots,D\}$ as its causal structure is defined by the following joint distribution over such variables: $$ P(\{X_{i, n}\}_{i\in\{1,\ldots,D\}, n\in\{1\ldots N\}}) = \int \cdots \int \prod_{n=1}^{N}\prod_{i=1}^{D} P(X_{i,n}\vert X_{\text{Pa}(i), n}, \theta_i) d\pi(\theta_1)\cdots d\pi(\theta_D), $$ where $\text{Pa}(i)$ denotes parents of node $i$ in graph $\mathcal{G}$. for some latent variables (de Finetti parameters) $\theta_1, \ldots, \theta_D$. ### Conditional independence structure and identifiability The relevance of the ICM generative models is that their causal independence structure is richer than that of i.i.d. data generating distributions over the same variables. The [causal de Finetti theorems](https://arxiv.org/abs/2203.15756) exploit this to show that the graph $\mathcal{G}$ is fully identifiable from the set of conditional independence relationships that hold between subsets of $\{X_{i, n}\}_{i\in\{1,\ldots,D\}, n\in\mathbb{N}}$. The same is not true if $X_{i, n}$ were i.i.d. distributed (i.e. for different values of $n$ these variables would be independent). ### Statistical independence of Markov factors The ICM generative model, as defined above, posits that the parameters $\theta_i$ that determine each Markov factor $P(X_{i,n}\vert X_{\text{Pa}(i), n}, \theta_i)$ are **statistically independent**. This is one operationalisation of the independence of causal mechanisms assumption, that different components of a causal mechanism that generates each variable in a multivariate situation are independent and do not inform each other. ## Algorithmic Causal Inference World In [Janzing and Scholkopf](https://arxiv.org/pdf/0804.3678.pdf), a completely different kind of causal inference approach is proposed. Typically, one draws conclusions about the causal graph $\mathcal{G}$ by first establishing which conditional independence relationships hold within the observed data, then turning these into a graph that is consistent with these (this requires many i.i.d. samples). In Janzing and Scholkopf, however, we will attempt to draw a conclusion about which variables causally influence which, from **a single sample**. In a nutshell, we will say that if the causal direction $X\rightarrow Y$ holds, then $$ K(y\vert x) + K(x) $$ is expected to be significantly smaller than the opposite $$ K(x\vert y) + K(y) $$ where $K$ denotes (conditional) Kolmogorov complexity. ### Algorithmic independence of Markov factors assumption Janzing and Schölkopf posits algorithmic independence of markov factors. For example, if $P(Y\vert X)$ and $P(X)$ are two Markov factors that define a causal model, then we will assume that $C[P(Y|X), P(X)] = C[P(Y|X)] + C[P(X)],$ that is, that the shortest program to describe both factors is of the same size as the concatenation of the programs that calculate each of the factors separately. Evidently, we should expect this to hold only in one direction, but not the other. The toy example they provide is very cute. Imagine $X$ is a binary random variable, and $Y$ is a Gaussian whose mean depends on $X$. Then describing $P(Y)$ is quite difficult as it's a mixture of Gaussians. ## Putting the two together I think that the causal de Finetti-style ICM formulation, and the algorithmic information theoretical formulation are related. Consider the following Solomonoff-style universal prior over data: $P(x, y) \alpha \sum_k P_k(y\vert x)P_k(x) 2^{- C[P_k(Y|X), P_k(X)]}$ where the sum is over all computable pairs of Markov factors $P_k(Y|X), P_k(X)$. Similarly, we can define a multi-sample prior over multiple observations as follows: $P_U(\{x_n, y_n\}_{n\in\{1\ldots N\}}) \alpha \sum_k \prod_n P_k(y_n\vert x_n)P_k(x_n) 2^{- C[P_k(Y|X), P_k(X)]}$ Now, if we add in the algorithmic independence of Markov factors assumption, the prior splits up: $P_U(\{x_n, y_n\}_{n\in\{1\ldots N\}}) \alpha \sum_k \prod_n P_k(y_n\vert x_n)P_k(x_n) 2^{- C[P_k(Y|X)]} 2^{-C[P_k(X)]}$ I think one can probably show some version of the following: $P_U(\{x_n, y_n\}_{n\in\{1\ldots N\}}) \alpha \sum_k \sum_l \prod_n P_k(y_n\vert x_n)P_l(x_n) 2^{- C[P_k(Y|X)]} 2^{-C[P_l(X)]}$ Note that now $P(Y|X)$ and $P(X)$ became **statistically** independent under the prior. Furthermore, I think that it's likely possible to show that for a single sample this prior satisfies: $$ -\log P_U(x, y) \hat{=} K(y\vert x) - K(x), $$ where $K$ is Kolmogorov complexity defined by some UTM, and $\hat{=}$ denotes equivalence up to some additive, multiplicative constants in the appropriate places. Therefore, the algorithmic causal inference idea of Janzing and Schölkopf can be expressed as a likelihood ratio test evaluating the marginal likelihood of data under the universal priors in two different directions. ### Why is this interesting? If some form of equivalence holds here, that may be theoretically interesting in its own right, connecting the algorithmic independence of mechanisms to statistical independence in the cdF framework. But it also potentially suggests more can be done. Currently, using the AIT formulation, we can identify causal structure from a single statistical sample. With this equivalence, we may be able to extend that to causal discovery from an entire exchangeably sampled dataset. Using the ideas from the [DeepMind Solomonoff paper](https://arxiv.org/html/2401.14953v1), it may be possible to "meta-train" a neural network to perform causal inference tasks under the universal Solomonoff prior over causal models. If we can train a model that performs inplicit Bayesian inference in the universal prior, it may allow us to: * infer causal direction by performing likelihood tests * predict the effects of intervention by learning to approximate $P_U(y_{n+1}\vert do(x_{n+1}), x_{1:n}, y_{1:n})$. *Conjecture:* If the exact $P_U$ is used, the above predictive distribution will converge to the correct estimand (that one would obtain using the rules of do-calculus on $\mathcal{G}$) if the causal query is identifiable. * provide a Bayesian normative theory for how causal queries should be answered, if those causal queries are not identifiable using do calculus. * learn to make counterfactual predictions under a universal prior over S.E.M.s So, there's potentially a whole PhD's worth of questions here. The main question is to what extent these questions have already been explored, and of course, whether any of this can be turned into practicable algorithms.