# Mediation Project **Intro** This project is about doing mediation analysis when both the mediator and outcome can depend nonlinearly on the treatment and confounders. The mediator could be high dimensional (e.g., epigenetic). **Goals** - Can we obtain quality uncertainty using Bayesian approaches? - Can harness the representation learning ability of neural networks? This has been studied for ATE estimation (e.g., [Dragonnet](https://arxiv.org/pdf/1906.02120.pdf)) but not mediation analysis. - Can we obtain rigourous theoretical guarantees (e.g. using ideas like Double Machine Learning)? **Recent progress** - More basic experiments using GPs - Visualizations of model fits - Basic experiment with multiple mediators. **Next steps** - Experiments with multi-dimensional mediators. - Add neural linear models to experiments. - Multi-headed architectures for binary treatments. - Experiments with continuous treatments. ------- [toc] ------- # Meeting updates ## August 25th Neural network "Model types": 1. Separate models for $T=0$ and $T=1$ 2. Single model that takes $T$ as input 3. Shared base layers for $T=0$ and $T=1$ 4. Shared base layers for $T=0$ and $T=1$ with propensity score head 5. Shared base layers for $T=0$ and $T=1$ with propensity score head and targeted regularization (TMLE) All datasets have 1000 obervations with 25 covariates and a binary treatment (except IHDP only has 747 observations). Datasets: - linear: depends on 5 covariates - nonlinear: depends on 5 covariates, some nonlinearly - IHDP - rkhs-same: each of $\mu_0$ and $\mu_1$ are drawn from the same RKHS (with feature map given by a randomly initialized neural network) - rkhs-diff: $\mu_0$ and $\mu_1$ are drawn from the same RKHS (feature maps are differently initialized neural networks) Here are the first 3 datasets: ![](https://i.imgur.com/p4qmg7Q.png) Errors are much worse on the RKHS datasets: ![](https://i.imgur.com/WYpYDH8.png) ## July 14th, 2022 ### Bagladesh data - T=fib (binarized) - X=clinic, sex, prot, fat, ash, as_ln, mn_ln, pb_ln, carb (if not mediator) - Y=conitive - M=carb ![](https://i.imgur.com/54m6c2v.png) ![](https://i.imgur.com/Fhn2UoG.png) ![](https://i.imgur.com/89iqFUQ.png) ## July 8th, 2022 ### Recent papers using continuous treatments Univariate continuous treatment: - Generalized propensity score (GPS) methods (e.g. [[Hirano and Imbens 2004](https://www.math.mcgill.ca/dstephens/PSMMA/Articles/HIrano-Imbens-2004.pdf)]) - [[Sharma 2020](http://proceedings.mlr.press/v127/sharma20a/sharma20a.pdf)] (Hi-CI): Discretizes a continuous treatment and learns embeddings for each level of the treatment. - [[Colangelo 2022](https://arxiv.org/pdf/2004.03036.pdf)] extends double ML to a continuous treatment. Multiple (but separate) continuous treatments: - [[Bica 2020](https://proceedings.neurips.cc/paper/2020/file/bea5955b308361a1b07bc55042e25e54-Paper.pdf)] (SCIGAN): Uses a GAN to generate counterfactuals, extending a similar method used for binary treatments. - [[Schwab 2020](https://arxiv.org/pdf/1902.00981.pdf)] (DRNet): Divides each treatment into non-overalapping intervals and uses a separate head for each interval. - [[Nie 2021](https://arxiv.org/pdf/2103.07861.pdf)] (VCNet): Improves on DRNet by ensuring continuity in the dose-response curve. - [[Bellot 2022](https://arxiv.org/pdf/2205.14692.pdf)]: Provides generaliztion bounds and adjusted training objectives for DRNet and VCNet (I think, but I am slightly confused). Multivariate treatment: - [[Williams 2020](https://escholarship.org/content/qt5dv328b3/qt5dv328b3_noSplash_786e0dd9004089e7ceed0864d49138bb.pdf)] (mvGPS): Extends generalized propensity score methods to a multivariate setting. - [[Chen 2022](https://arxiv.org/pdf/2205.08730.pdf)] extends the entropy balancing method (?). My observations: - There's a resistance to using the treatment as an input (since it might get "lost" if the dimension of the learned representation is large [[Shalit 2017](https://arxiv.org/pdf/1606.03976.pdf)]). - There is little work in the setting of multivariate treatments. ## June 30th, 2022 Today I'm computing the conditional average treatment effect (as a function of *all* covariates) and reporting the RMSE over a test set. Recall $\mu_0(x)=\mathbb{E}[Y\mid T=0, X=x]$ and $\mu_1(x)=\mathbb{E}[Y\mid T=1, X=x]$ are the two conditional expectations. There are two datasets: - *Dataset 1: Shared covariates dataset*: $\mu_0(x)$ and $\mu_1(x)$ are the same function ($1^\top x^2$) but depend on a different subset of the covariates. By increasing the number of features used by $\mu_1(x)$ only, we can make $\mu_0(x)$ and $\mu_1(x)$ less similar. - *Dataset 2: Shared feature map dataset*: $\mu_0(x)$ and $\mu_1(x)$ are each drawn from a GP with kernel $\Phi(x)\Phi(x')^\top$, where $\Phi(\cdot)$ is the last hidden layer of a randomly intialized neural network. As the width tends to infinity, this kernel converges a fixed kernel (NNGP kernel). There are two models: - *TNet*: separate neural networks trained for each of $\mu_0$ and $\mu_1$ - *SNet*: Single neural network trained with separate heads for each of $\mu_0$ and $\mu_1$ There are also two implementations of each dataset: - My code - Their code [[Curth 2021](https://arxiv.org/pdf/2101.10943.pdf)] ### Dataset 1: Shared covariates dataset - $n=2000$ observations - $D=25$ inputs total - $x$-axis is number of features used by the true $\mu_1(x)$ that are not used by the true $\mu_0(x)$. For *small* values on this axis, $\mu_0(x)$ and $\mu_1(x)$ are more similar, so the SNet should do better. Their code: ![](https://i.imgur.com/NjVNps5.png) My code: ![](https://i.imgur.com/u8fZAEY.png) ### Dataset 2: Shared feature map dataset - $n=1000$ observations - $D=5$ inputs total - $x$-axis is the width of the neural network giving $\Phi(x)$. For *large* values on this axis, $\mu_0(x)$ and $\mu_1(x)$ are more similar, so the SNet should do better. Their code: ![](https://i.imgur.com/yIxsEM0.png) My code: ![](https://i.imgur.com/5LmwmYY.png) ## June 22nd, 2022 ### Project direction **Guiding questions:** - How can neural network architectures like TARNet be applied in a mediation setting? - When using neural networks for mediation, how should continuous inputs be handled? - Can double ML provide any theoretical guarantees? Additionally, we will focus on applications where the conditional mediation effect is of interest. **Focus for today:** - Trying to design a simulation environment where neural networks will perform well. - Still focusing on binary treatment since: - Nearly all of the literature on using neural networks for causality is in the binary setting, so if the goal is to adapt these architectures to mediation setting I want to start by keeping the treatment type the same. - We discussed some applications where treatment is continuous (e.g. pollutant mixtures), we also discussed applications where treatment may be discrete (e.g., epigenetics). - Even in the continuous setting, it has been argued that multi-headed neural networks perform better anyway [[Schwab 2020](https://arxiv.org/pdf/1902.00981.pdf)], so there's reason to study the two-headed network used for binary treatments. ### Experiments on ATE estimation I've found it difficult to design appropriate simulated datasets, so I'm taking a step back and focusing on ATE estimation. Let $\mu_0$ and $\mu_1$ be the true conditional probability distributions of the outcome: $$\mu_0(x)=P(Y|T=0, X=x)\\ \mu_1(x)=P(Y|T=1, X=x)$$ **Goal**: somehow control similarity between $\mu_0$ and $\mu_1$, so that when $\mu_0$ and $\mu_1$ are similar, the a neural network with a shared representation $\Phi(X)$ will perform better. Otherwise, separate neural networks trained on data where $T=0$ and $T=1$ will perform better. **Idea 1:** Instead of controling the similarity of the *functions* $\mu_0$ and $\mu_1$, instead control the similarity of their *inputs* (i.e., how many covariates they both depend on). This idea is used in [[Curth 2021](https://arxiv.org/pdf/2101.10943.pdf)]. So, let $X$ denote all $D$ covariates and let $X_0$ and $X_0$ denote the subseet of the covariates on which $\mu_0$ and $\mu_1$ depend. Assume that $\mu_0=\mu_1=\mu$. At one extreme, if $X_0$ and $X_1$ completely overlap in covariates (i.e. $X_0=X_1$), then $\mu_0(X_0)$ and $\mu_1(X_1)$ are exactly the same. At the other extreme, if $X_0$ and $X_1$ do not share any covariates, then $\mu_0(X_0)$ and $\mu_1(X_1)$ can be very different. The number of overlapping covariates controls there similarity of $\mu_0(X_0)$ and $\mu_1(X_1)$. Following [[Curth 2021](https://arxiv.org/pdf/2101.10943.pdf)], I use $$ \mu_i(X_i) = \mathbf{1}^\top X_i^2. $$ Unfortunately, I'm having trouble getting this example to work. I thought when $X_0$ and $X_1$ shared no covariates that training separate models should work best, but this is not what I'm finding. In general there is little consistency in the results, it's possible there is a bug. **Idea 2:** draw $\mu_0$ and $\mu_1$ from GPs and control the similarity of their *kernels*. Specifically, let $$ \mu_0 \sim GP(0, K_0(\cdot, \cdot)) \\ \mu_1 \sim GP(0, K_1(\cdot, \cdot)) $$ where there kernels are drawn from some shared distribution: $$ K_0, K_1 \sim p(K) $$ Specifically, I'm considering degenerate kernels $K(X_1,X_2)=\Phi(X_1)\Phi(X_2)^\top$, where $\Phi(\cdot)$ is the last hidden layer a neural network. Drawing $K$ corresponds to randomly initializing the weights of the neural network. This allows the width of the neural network to control the similarity of $K_0$ and $K_1$, and thus the similarity of $\mu_0$ and $\mu_1$, since as the width goes to infinity, $\Phi(X_1)\Phi(X_2)^\top$ converges to the NNGP. **Hypothesis:** as the width tends to infinity, the two-headed neural should perform better than separate networks for each $T$, since the true conditional probabilities $\mu_0$ and $\mu_1$ share a feature map. *Note:* `tnet_num` refers to the architecture choice: - `tnet_num=1`: $T$ is an input to a single model. - `tnet_num=2`: Separate models are trained on data where $T=0$ and $T=1$. - `tnet_num=3`: Two-headed neural network with shared representation. **Comparing across models (fixed data generation process)** I used $100$ hidden units for the random feature map used to generate the data. *rows*: `tnet_num` [1,2,3] *columns*: input dimension of $X$ [1,2,4] Takeaways from plot below: - Error is larger for higher dimensional covariates $X$ - GP does relatively the best when $dim(X)=1$ and relatively the worst when $dim(X)=4$. - Neural net does relatively the best in highest dimension. - Performance is similar single and separate models (`tnet_num=1` and `tnet_num=2`) ![](https://i.imgur.com/qfCVpQ5.png) **Comparing across models (fixed input dimension: $dim(X)=4$)** *rows*: `tnet_num` [1,2,3] *columns*: number of hidden units of random $\Phi$ used in generating data [1,2,4] ![](https://i.imgur.com/Kdm4tWB.png) **Comparing across architectures (fixed input dimension: $dim(X)=4$)** *rows*: models [linear, GP, neural network] *columns*: hidden units of $\Phi$ used to generate data [1,2,4] Takeaways from plot below: - Overall not much difference as you change the hidden units in $\Phi$ (used for generating data) (i.e., columns). - For the neural network, I was hoping to see `tnet_3` do best when $\Phi_0$ and $\Phi_1$ are the most similar (bottom right), but there isn't evidence of this. ![](https://i.imgur.com/wbRKVLQ.png) ## June 10th, 2022 Things that have changed: - Neural linear models with multiple heads implemented. It works on toy examples but I'm having difficulty with more difficult examples (I think gradients are blowing up). - Working on adapting a dataset from [Curth 2021](https://arxiv.org/pdf/2101.10943.pdf) that will highlight the value of sharing parameters between treatment and control. Basic idea is to generate data by sharing covariates between treatment and control. For example, the mediator can be generated as $$ M = T \mu_1(X_1) + (1-T) \mu_0(X_0) + \epsilon. $$ In the simplest case, $\mu_0$ and $\mu_1$ could be the same function (e.g., $\mu(X) = (X^2)^\top 1$) where $X_1$ and $X_0$ are a subset of the covariates that partially overlap. Similarly, we can generate $Y$ as $$ Y = T \eta_1(X_1, M_1) + (1-T) \eta_0(X_0, M_0) + \epsilon, $$ where $M_1$ and $M_0$ are a subset of the mediators that partially overlap. Takeaways from committee meeting: - Focus on application where the conditional mediation effect is of interest (since otherwise the mediation effects can be estimated accurately without estimating the functions accurately) (EH is an example) - Look into how different subgroups are impacted ## May 13th, 2022 ### Experiments with high dimensional $M$ I tried varying the dimension of $M$ from 2 to 32. Overall the results are messy. Here's the distribution of the ground truth NDEs and NIEs for the two ground truth data generating functions. | | Ground truth = Linear | Ground truth = Nonlinear | | -------- | -------- | -------- | | NDE | 0.25 +/- 1.05 | -0.01 +/- 0.23 | | NIE | 0.63 +/- 1.90 | -0.02 +/- 0.34 | ![](https://i.imgur.com/46biF87.png) ![](https://i.imgur.com/uOoGx6K.png) A few notes on the experiments: - Averaged over 5 random datasets. - Number of inducing points set to $\sqrt{n \ d_M}$, where $n$ is the number of observations and $d_M$ is the dimensionality of $M$. - For the multi-output GP modeling $M$, I used a separate GP for each dimension of $M$. I tried using a linear model of coregionalization so I could use fewer GPs but I had trouble with the implementation in `gpflow`. - For the $M$ model, I used a linear + RBF kernel and optimized the hyperparameters by maximizing the ELBO. For the $Y$ model I used an exact GP with and RBF kernel and fixed hyperparameters (lengthscale=1). - The nonlinear ground truth function is a neural network with 100 hidden units and tanh activation. - Model Structure 1 is used to generate all data and for inference. ### Experiments with continuous $T$ Due to time constraints, I only ran up to 16 mediators and 3 simulations instead of 5. Otherwise the settings are the same as above. The results seem to be dominated by a few examples with high error. Should I instead report the relative error? | | Ground truth = Linear | Ground truth = Nonlinear | | -------- | -------- | -------- | | NDE | -0.07 +/- 0.76 | 0.01 +/- 0.14 | | NIE | -0.43 +/- 1.55 | 0.07 +/- 0.27 | ![](https://i.imgur.com/z2lIm6l.png) ![](https://i.imgur.com/bxeaVHi.png) Note: To compute the NDE and NIE, I compared the treatments at $T=0.5$ and $T=-0.5$. Although $T$ is always centered around 0, it may be on a different scale depending on the dataset. ## April 27th, 2022 This week I did some basic experiments comparing different model structures (e.g. fitting a single model that takes in $T$ as an input vs fitting separate models for each value of $T$), different models, and different datasets. The goal is to do basic sanity checks: e.g., linear models should perform poorly when the data is nonlinear, or modeling $T=0$ and $T=1$ as separate functions should perform poorly if the ground truth uses a single function. **Model structures** I considered three model structures for $p(M|T,X)$ and $p(Y|T,M,X)$. I always used the same model structure for modeling $p(M|T,X)$ and $p(Y|T,M,X)$. - Model Structure 1: A single model that maps $(X,T)$ to $Y$. - Model Structure 2: Separate models, one model for each value of $T$, mapping $X$ to $Y$. - Model Stucture 3: $X$ is taken as input and then split into separate heads, one for each value of $T$. This is essentially TARNet. I have not implemented this yet. ![](https://i.imgur.com/jtpDDWZ.png) **Experimental setup** ~~Strikethough~~ indicates settings I have not yet implemented. * Model Structures (used to model $p(M|T,X)$ and $p(Y|T,M,X)$): 1. Single model that takes $T\in\{0,1\}$ as input 3. Separate models for $T=0$ and $T=1$ 4. ~~Hybrid model: some parameters are shared for $T=0$ and $T=1$, some are not~~ * Models: - GPs (Linear, RBF, ~~NTK~~, ~~NNGP~~) - BNNs * Datasets: - 1: Model Structure 1 with linear models - 2: Model Structure 2 with linear models - 3: Model Structure 1 with neural networks - 4: Model Structure 2 with neural networks When I say a "single model" I mean a single model for $p(M|T,X)$ and a (separate) single model for $p(Y|T,M,X)$. So, for example, Model Structure 1 involves fitting 2 models: $p(M|T,X)$ and $p(Y|T,M,X)$. On the other hand, Model Structure 2 involves fitting 4 models: $p(M|T=0,X)$, $p(M|T=1,X)$, $p(Y|T=0,M,X)$, and $p(Y|T=1,M,X)$. For data generation, I used the same model structures but fixed a (random) linear or neural network function for each of the models within the overall model structure. The plots below show the mean average error in computing the NDE and NIE for all of the model structures, model types, and datasets. The BNNs (and NNs used for data generation) have only a single layer with 10 hidden units. We can see a few trends in the results. For example, the single linear model performs poorly when the ground truth is separate neural networks for each value of $T$ (i.e., top right panel). On the extreme, the BNN does poorly when the ground truth is a single linear model (bottom left panel). ![](https://i.imgur.com/7v49QZK.png) ![](https://i.imgur.com/FPFYZeX.png) ## May 6th, 2022 This week I'm building off of the experiments from last week by adding: - Visualizations of the model fits - Sanity checks of the experimental setup. I started this last week but didn't really finish. - Higher dimensional $M$. ### Visualizations To provide some insight into the simulation framework, here's an example in the most complicated setting I have so far: - Data generation: Separate neural networks used for each value of $T$. - Model: Separate GPs with RBF kernels used to model each value of $T$ (Model Structure 2). The same approach is used to model $p(M|T, X)$ and $P(Y|T, M, X)$, so 4 GPs are fit in totoal. Here is the model for $p(M|T, X)$, with the colors denoting $T=0$ and $T=1$. ![](https://i.imgur.com/PayWp4a.png) Now here is the model $p(Y|T, M, X)$. In the first two panels, $M$ and $X$ are held fixed at their median values, respectively. ![](https://i.imgur.com/GmH5KMz.png) ### Sanity checks I'm building off of the experiments from last week. I will just show the NIE for simplicity. To start, here's a comparison a model with a **linear kernel** fits the 4 different dataset types. The colors correspond to single (Model Structure 1) and separate (Model Structure 2) linear models for infernece. Uncertainty bands reflect 25 datset simulaitons. Here are a few observations that all agree with what we should expect: - When the data is generated by separate functions for each $T$ (Datasets 2 and 4), Model Structure 2 does better. - When the ground truth is a single linear function (Dataset = 1), Model Structure 1 does better. - When the ground truth is a single nonlinear function (Dataset 3), both model structures perform about the same because neither can fit the ground truth particularly well. ![](https://i.imgur.com/3ZaX6WF.png) Here's the same plot as above but now including a GP with an **RBF kernel** (uncertainty removed for readability). Notice: - When the ground truth is linear (Datasets 1 and 1), the linear kernel does better (solid lines). - When the ground truth is nonlinear (Datasets 3 and 4), the RBF kernel does better (dashed lines). ![](https://i.imgur.com/mK9Fj6m.png) Here's a comparison between the RBF kernel and **MLP kernel**. The MLP kernel performs worse on the single linear model dataset (Dataset 1) but otherwise similarly to the RBF kernel. ![](https://i.imgur.com/mtFCpDI.png) To get a sense of scale, here's the distribution of ground truth NDEs and NIEs over all of the simulations. ![](https://i.imgur.com/ACnFIMS.png) ### Multiple mediators I did a basic experiment with 2 mediators. This is really just to confirm the code is running. I used `gpflow` with an RBF kernel to model the multiple mediators. Here is the fitted model plotted as a function of the confounder: ![](https://i.imgur.com/pRfmr7a.png =400x) Now here is the model of $Y$, also as a function of the confounder. ![](https://i.imgur.com/48zXc1s.png =400x) Finally, here $Y$ as a function of the two mediators (i.e. in each panel, one mediator is varied and one is held constant). ![](https://i.imgur.com/malfL68.png =600x) The estimated NDE is -0.24 (true = -0.34) and the estimated NIE is 0.092 (true = 0.051). Overall it seems like the implementation is working. # Background ## Notation - $T$ treatment (currently binary) - $X$ confounders - $M$ mediator - $Y$ outcome - $n$ number of observations ## Computing NDE and NIE NDE: \begin{align*} &= \mathbb{E}[Y_{T=1, M_{T=0}}] - \mathbb{E}[Y_{T=0, M_{T=0}}] \\ &= \sum_x \sum_m (\mathbb{E}[Y \mid T=1, M=m, X=x, \mathcal{D}] - \mathbb{E}[Y \mid T=0, M=m, X=x, \mathcal{D}]) p(M=m \mid T=0, X=x) p(X=x) \\ &= \mathbb{E}_X \left[ \mathbb{E}_M [ \mathbb{E}[Y \mid T=1, M=m, X=x, \mathcal{D}] - \mathbb{E}[Y \mid T=0, M=m, X=x, \mathcal{D}] \mid T=0, X=x, \tilde{\mathcal{D}} ] \right] \\ &\approx \frac{1}{N} \sum_{n=1}^N  \mathbb{E}_M [ \mathbb{E}[Y \mid T=1, M=m, X=x_n, \mathcal{D}] - \mathbb{E}[Y \mid T=0, M=m, X=x_n, \mathcal{D}] \mid T=0, X=x_n, \tilde{\mathcal{D}} ] \\ &\approx \frac{1}{N} \sum_{n=1}^N  \frac{1}{S} \sum_{s=1}^S   \mathbb{E}[Y \mid T=1, M=m^{(ns)}, X=x_n, \mathcal{D}] - \mathbb{E}[Y \mid T=0, M=m^{(ns)}, X=x_n, \mathcal{D}] \\ &\approx \frac{1}{N} \sum_{n=1}^N  \frac{1}{S} \sum_{s=1}^S   \frac{1}{K} \sum_{k=1}^K y_1^{(nsk)} - y_0^{(nsk)} \end{align*}