Causal de Finetti Theorems

--- title: Causal de Finetti Theorems tags: Idea description: --- # Causal de Finetti theorems A sequence of random variables $X_1, X_2, \ldots$ is exchangeable, if for any finite N the joint distribution of variables is permutation invariant, that is $$ p(x_1, x_2, \ldots, x_N) = p(x_{\pi(1)}, x_{\pi(2)}, \ldots, x_{\pi(N)}) $$ where $\pi$ is a permutation of integers. The de Finetti theorem states that if a stochastic process is exchangeable, the joint distribution decomposes as follows: $$ p(x_1, \ldots, x_N) = \int \prod_n p(x_n \vert \theta) d\pi(\theta) $$ ## Exchangeable sequences of pairs of random variables Now consider we have a sequence of pairs of random variables $(X_1, Y_1), (X_2, Y_2), \ldots$. Such sequence is exchangeable if $$ p(x_1, y_1, x_2, y_1, \ldots, x_N, y_n) = p(x_{\pi(1)}, y_{\pi(1)}, x_{\pi(2)}, y_{\pi(2)}, \ldots, x_{\pi(N)}, y_{\pi(N)}) $$ for all permutations $\pi$. Then, by the de Finetti theorem, the joint distribution decomposes: $$ p(x_1, y_1, x_2, y_1, \ldots, x_N, y_n) = \int \prod_n p(x_n, y_n \vert \theta) d\pi(\theta) $$ ## Decomposition in terms of Markov factorisations We would like to find conditions under which stronger decomposition property holds, in partricular: $$ p(x_1, y_1, x_2, y_1, \ldots, x_N, y_n) = \int \prod_n p(y_n \vert x_n \psi) p(x_n \vert \theta) d\pi(\theta) d\rho(\psi) $$ What this implies is that the data generation process admits the following graphical model, using plate notation: ![](https://i.imgur.com/0RyKsuQ.png) In causal language this means: we observe $X$ and $Y$ pairs in $K$ different envionments/domains. In each environment $Y$ is caused by $X$, but the mechanism controlling the marginal distribution of $X$ and the conditional distribution of $Y\vert X$ are independently randomly modulated across domains. If we find a de Finetti theorem that guarantees such decomposition, we can turn that into a way to test for independent causal structure from multi-domain observations. ### A conjecture I think that the following two conditions are necessary and sufficient for the causal decomposition to hold: 1. the sequence $(X_1, Y_1), (X_2, Y_2), \ldots$ is infinitely exchangeable and 2. $\forall m\neq n: Y_n\perp\!\!\!\perp X_m \vert X_n$ ### Proof sketch (maybe) Let's consider the case when both $X$ and $Y$ are binary. Then, any joint distribution between the two can be described by three real-valued parameters. The first parameter, $\theta$, defines $\mathbb{P}[X=1]$ The second parameter, $\psi_1$, defines $\mathbb{P}[Y=1\vert X=0]$. The third parameter, $\psi_2$, defines $\mathbb{P}[Y=1\vert X=1]$. The de Finetti theorem implies, that exchangeable sequences decompose as: $$ p(x_1, y_1, x_2, y_1, \ldots, x_N, y_n) = \int \prod_n p(x_n, y_n \vert \theta, \psi_1, \psi_2) d\pi(\theta, \psi_1, \psi_2) $$ $$ Y_n\perp\!\!\!\perp X_m \vert X_n $$ $$ \mathbb{P}[Y=1\vert X_1=0, X_2=0] = \mathbb{P}[Y=1\vert X_1=0, X_2=1] $$ ### A test Let's say we have $K$ samples from an infinitely exchangeable process, each with length $N_k$. In other words, we observe data in $K$ domains, $N_k$ observations in each domain $k$. Let's denote the $n^{th}$ sample from the $k^{th}$ domain $(x^{k}_n, y^{k}_n)$. On the basis of this, sample triplets as follows: $(A = x^{k}_n, B = y^{k}_n, C = x^{k}_m)$, for randomly chosen $k$, $n$ and $m\neq n$. We can then test for conditional independence $A\perp\!\!\!\perp C \vert B$ in the resulting set of triplets. If the null-hypothesis of this conditional independence test corresponds to the causal structure $X \rightarrow Y$. ### Connection to causal discovery A key concept in causal discovery is equivalence classes between DAGs: two graphical causal models are equivalent if they define the same set of conditional independence relationships among observable variables. The causal model can be inferred from observational data only up to equivalence classes. For example, consider the figure below. ![](https://i.imgur.com/Hd8hxtu.png) White nodes denote observable variables, shaded ones are unobserved. The causal models A and B are in the same equivalence class. It's impossible to tell which variable causes the other from just sampling data from the joint distribution. A and B are observationally equivalent. Now consider the DAGs C and D. Notice these DAGS are the same ones I drew earlier, except I now only draw N=2 samples per domain and I dropped the plate notation. These DAGs correspond to the following generative story: We assume that we can observe data from different domains. There is a DAG describing the causal relationships between the variables which is the same across all domains. Within each domain, the data is sampled from a generative model that conforms to this DAG. However, the conditional distibutions in the Markov factorisation depend on some hidden variables (the de Finetti variables, $\psi$ and $\theta$ in the examples befoer). Within each domain, the these per-factor hidden variables take the same value for all samples, but across domains they vary randomly, so that in each domain we observe data from a slightly different joint distribution, but all conforming to the exact same causal DAG. In other words, the DAG structure is invariant across domains, but the actual joint distributions are domain-dependent. The per-factor latent variables are assumed to vary independently of one another, too. C and D show the causal model of this sampling process for two observable variables (left and right) with just two pairs observations in each domain (top $(X_1, Y_1)$ and bottom $(X_2, Y_2)$). The two observations in the same domain are connected by the shared value of the per-factor random variables, which are the shaded nodes. Now my statement is, even though A and B are in the same equivariance class, C and D are not. You can tell C and D apart based on just observational data. This is because in C the independence $Y_1\perp\!\!\!\perp X_2 \vert X_1$ holds but in D it does not. So, a general question is: what are the equivalence classes for DAGs of the kind like C and D: These are two copies of the same DAG, but node-wise connected by an unobserved confounder? What edges in the graph become identifiable from data in this type of a graph which are unidentifiable in the corresponding i.i.d. setup? Are there any edges that are identifiable in the i.i.d. setup but for some reason aren't in this exchangeable setting?