# Reproducing Learning Counterfactual Representations for Estimating Dose-Response Curves
## Reproducibility Project of Group 36
This blog was created for the reproducibility project of the course "Deep Learning" (CS4240). The paper being reproduced is "<em>Learning Counterfactual Representations for Estimating Individual Dose-Response Curves</em>" ([1]).
Group 36 consisted of the following teammembers with their respective contributions:
- Patrik Bartak (5086337): implemented DRNet and set up the cloud VM. Wrote the experiment setup of the blog.
- Matej Havelka (5005914): worked mainly on setting up the data connection and Experiment class. Implemented CF and kNN. Ran all the experiments on my computer as google colab crashed due to low RAM. Wrote "Introduction to Causal ML" and "Results & Discussion" parts of the blog.
- Marco van Veen (5062322): implemented TARNet and MLP models, and PE/DPE metrics (not used in the end). Wrote blog sections "DRNet" and "Conclusion".
Code: https://github.com/MatejHav/reproducibility-project
## Introduction to Causal ML
<!-- - Small CML intro (treatment effect estimation usually) -->
<!-- - What are dosages -->
<!-- - How does it change the setup -->
<!-- - TARNET and MLP solutions
- Mention DRNet tries to do this better -->
<!-- - Goal of this reproducibility project
- reimplement everything ourselves using only paper in modern framework (python3 with pytorch instead of python2 with tf 1)
- compare to the results from the paper (news dataset)
- compare to slightly different drnet architecture and our custom outcome data -->
Before we start describing the contents of the paper we tried reproducing, we thought it would be fitting to provide a small introduction into the field of Causal Machine Learning. If you're already familiar with this field, then feel free to skip to the next section.
Contrary to regular Machine Learning, where an algorithm is trying to learn a function based on found correlations, Causal ML is trying to learn the underlying causal connection between a feature *T* and the outcome *Y*. This feature *T* is usually referred to as treatment, as most applications are meant to study the effect of some treament on the outcome of a patient. To further expand on this medical use, one can also include dosage *S* to represent the amount of treatment a patient receives. For example, how many pills a patient should take. For more formal explanations, researchers usually create directed acyclic graphs to show how one feature influences the other. An example of such a DAG is shown below. You can see how features *X* influence the treatment *T*, dosage *S* and all of them influence the outcome *Y*.
<div style="text-align:center;">
<img src="https://i.imgur.com/yCLavYw.png" style="width:30%;" alt="Image description">
</div>
*Example of a DAG showing the causal relations between covariates X, treatment T, dosage level S, and outcome Y.*
With this, the goal is usually to calculate the **treatment effect**. That is the difference between the expected outcomes when a patient receives and does not receive a treatment. More formally:
$$ TE = E[Y | X, T=0] - E[Y | X, T=1] $$
The problem is that we can never observe the counterfactual outcome, which is what the outcome of a patient would have been if their assigned *T* was reversed. A simple solution would be to create a simple connected network with treatment and dosage as the parameters and train it on the data we have. Then, we simply run a single sample through the network with treatment and without treatment, and we should obtain the expected outcomes. This, however, suffers from multiple problems, mainly from the fact that it does not try to find causations, but only correlations. So, if mainly elderly people got a specific treatment, the outcome could be based solely on the fact that they got the treatment, rather than also on the fact that they are elderly.
To make a better network, [2] came up with an architecture to generate these counterfactuals called TARNet. This network consists of one base layer that serves to convert the input features *X* within the latent space. Then, the treatment head (*t=1*) gets trained only on the outcomes of the patients that received the treatment (or didn't receive treatment for the other output head when *t=0*). The diagram for TARNet is shown below:
<div style="text-align:center;">
<img src="https://i.imgur.com/FVLrNAw.png" style="width:50%;" alt="Image description">
</div>
*An overview of the TARNet architecture as presented in [2]. The hidden, shared base layers take in all covariates X and branch off into two different output heads. The correct output head is indexed using the binary treatment assignment t.*
This way it is possible to train each treatment head on data that we have, and then run a sample that did not receive the treatment through the treatment head to obtain the expected outcome if the patient did receive the treatment. With this estimated outcome we can then compute the treatment effect by comparing it to the true, observed outcome we have.
DRNet, which is the network that the paper being reproduced introduces, tries to make this even better. Mainly, it focuses on adding the dosages to the equation in a sensible manner. This will be discussed more in the next section.
Our goal was to reproduce some results from the paper. These are some of the parts of the following table from [1]:

*Table from [1] showing the experiment results of DRNet + extensions along with alternative baseline models over different datasets. The values shown are root mean squared errors (RMISE).*
After looking at the given resources from the author, we have decided to implement everything from scratch, as the original code was written in Python 2 and using TensorFlow 1. We wanted to see if we can obtain the same results for all the News datasets for DRNet and other Deep Networks mentioned in the table. We also wanted to run Causal Forest (CF) and kNN adaptation and see how these deep networks perform against more classical ML approach.
We also decided to run some other experiments to further test the claims of the paper. We tested whether DRNet still manages to learn in a custom setting developped by us, and whether the hierarchy of the architecture proposed by the author is as important as they claim it to be.
## DRNet
### Overview
A visualisation of the Dose Response Network (DRNet) architecture is shown below. First, the features go through the shared base layers L<sub>1</sub>. Next, there are multiple treatment paths to follow, one for each possible treatment. A single treatment path in L<sub>2</sub> is selected based on the treatment value assigned to each sample. Finally, there's multiple outcome heads in L<sub>3</sub> for each dosage level. The correct outcome head is again indexed based on the dosage level of the sample. Since dosages can be continuous values, the range of possible values is split into <em>E</em> subintervals which are used to index the dosage heads in L<sub>3</sub>, such that each outcome head models its respective subinterval of the total dosage response function.
<div style="text-align:center;">
<img src="https://i.imgur.com/Gc3VQW3.png" style="width:70%;" alt="Image description">
</div>
*Overview of DRNet architecture from [1]. The covariates always go through the shared base layers. Then, only a single treatment and dosage outcome path is followed depending on the sample's respective treatment and dosage values.*
The above architecture leads to two main advantages. Firstly, data sparsity can be an issue for dosage effect estimation, as you require many samples with similar covariates that cover all possible treatment and dosage options. DRNet tries to combat this issue through the use of many shared layers (L<sub>1</sub> and L<sub>2</sub>). Secondly, simply adding the treatment and dosage as additional input covariates can lead to losing this information throughout high-dimensional hidden layers. In DRNet this is prevented by instead indexing the specific paths using the treatments and dosages.
The architecture was tested on a few (semi) synthetic datasets with the results shown above in the introduction. These results were based on the root mean integrated squared error (RMISE) metric, which is shown below. This metric evaluates how well the model managed to capture the true dosage response function of each possible treatment.
$$RMISE^{2} = \frac{1}{N}\frac{1}{|T|}\sum_{t\in T}\sum_{n=1}^{N}\int_{s=a_{t}}^{b_{t}}(y_{n,t}(s)-\hat{y}_{n,t}(s))^{2}ds$$
### Reproduction Challenges
The original authors link to a GitHub repository containing all the code to run the experiments. However, when inspecting this repository we noticed that everything was written in Python 2.7 and using an old TensorFlow version (1.4). Furthermore, it is not very clear how everything exactly works, as there are many files and a lack of comments throughout the code. So, we decided to implement everything ourselves in Python 3 and PyTorch using only the information provided in the paper.
This, however, lead to some challenges as some details about the implementation and experiments were not explained very clearly or even completely missing. The main issues we faced were the following:
- It is quite unclear how exactly the News data was generated. From the description it seems that the dosages of treatments have no effect at all on the final outcome, as dosage values of samples are randomly generated and not used in any of the outcome functions. This seems to defeat the purpose of estimating dosage response outcomes.
- For some of the tested models it was not clear how to extend them to use dosage levels correctly based on the paper. This was mostly the case for the non-neural network models, such as Causal Forests or k-NN. Due to this, we mostly stuck to testing the neural network-based models (DRNet, TARNet, MLP).
- In DRNet only one outcome head is active every time a prediction is made depending on the treatment and dosage values. However, it was not clear from the paper how this affected the training process. We were not sure whether all of the heads were still trained at the same time, or one by one for example. Differences in training methods could affect the resulting hidden layers.
- Finally, neural networks and their training require a lot of hyperparameters, but most of them are not mentioned in the paper. Some versions of the paper have an additional appendix where the 4 hyperparameters and their values used in hyperparameter optimisation are mentioned, but this still does not cover many of the missing parameters.
## Setup of experiments
This section describes the setup of the experiments used to reproduce the results obtained by the paper. First, we discuss the setup of the models, and then the setup of the data.
### Models
As mentioned above, we decided to re-implement DRNet, TARNet, and the MLP from scratch to test the completeness of the paper in describing the architectures.
DRNet was implemented as a hierarchy of neural network "blocks", each block consisting of 3 neural network layers. After passing data through the base block, the treatment is used to index which of the treatment blocks the data should be passed through next. After this, the dosage stratum is used to determine which head block the data should be passed through. The output of this is then the outcome. In total then, each data instance is passed through 3 blocks (9 layers), and the treatment and dosage determine which. The base block has all data forwarded through it, while the others have only a fraction.
The authors claim that a key advantage of this architecture is the parameter sharing (the DRNet "hierarchy") that results from forwarding data through the common base and treatment layers. This should enable shared representations to be learned for the different treatments and dosages. In order to test this claim, in addition to a standard reproduction, we also devised an a small ablation experiment (the hierarchy experiment) where the treatment and head blocks are flattened into a architecture similar to TARNet, and the cartesian product of treatment and dosage is used to index the head blocks. We anticipate that this will perform similarly to TARNet, as the parameters of the treatment layers are no longer shared.
TARNet was implemented similarly, but the dosage was not used for indexing, as TARNet was created only for the categorical treatment case. This means the dosage was simply added as a feature to the network input. This input passes through a base block, a categorical treatment is used to index the treatment block, and the output of this is the outcome.
For the MLP, both the treatment and the dosage were simply added as features to the network input. No indexing is used.
### Data
The models were run on the data described in the paper. Only the NEWS dataset was considered, as the MVICU dataset required special medical clearance to access, and the TCGA dataset was prohibitively large.
A modified dataset was tested where the dosage effect was changed, in order to examine the way in which the different networks responded to different data.
An interesting observation was that the authors made use of PCA as a dimensionality reduction technique, but only described this in the appendix:
> "For datasets of dimensionality higher than 200, we matched on a low-dimensional representation obtained via Principal Components Analysis (PCA) using 50 principal components in order to reduce the computational requirements." [1]
The appendix was not present in some published versions. This would likely have make reproduction efforts much more difficult if we had not found it, as the NEWS dataset has 2870 features.
There was also significant confusion amongst the team regarding the description of the actual data generating process, which was described in an unclear way and using difficult terminology. No causal DAG was included, making it impossible to verify that our understanding of the data generating process is correct. We found this odd, given that using the correct data is critical for reproducibility.
Our understanding of the data generating process was as follows:
First we generate a vector of random means based on $N(0.45, 0.15)$ and variations as $N(0.1, 0.05)$. For each treatment we also pick 2 completely random samples which represent the samples with minimum dosage and maximum dosage. We then generate the potential treatment outcome $y_t \sim N(\mu_t, \sigma_t) + \epsilon$ and potential dosage outcome $y_{s,t} \sim d_0N(\mu_{s_t,0}, \sigma_{s_t,0} + d_1N(\mu_{s_t,1}, \sigma_{s_t,1})$ where $d_0$ represents the distance to the sample with minimum dosage and $d_1$ represents the distance to the sample with maximum dosage. We then generate the treatment as $t \sim Multinomial(1, softmax(D(k * y_t * y_{s_t,i}))$ which basically means we pick a treatment with probability based on the potential treatment outcome and potential dosage outcome. We then generate the dosage $s \sim Exp(0.25)$. The final outcome $y = 50 * y_t y_{s_t, i}$.
During testing we generate the true outcome by finding the closest sample to the test sample from a pre-existing sample of the dataset consisting of 50000 samples. We then check the pre-PCA values and generate the true outcome based on that.
<div style="text-align:center;">
<img src="https://i.imgur.com/wl5RsNl.png" style="width:30%;" alt="Image description">
</div>
Given this understanding, we generated the NEWS-t datasets - where t is the number of treatments - for the same values of t as the authors. This means NEWS-2, NEWS-4, NEWS-8, and NEWS-16.
## Results & Discussion
### Reproduction of the paper results
We ran the experiment with the data generation that was described in the paper and got the following results in terms of RMISE:
| Model | News-2 | News-4 | News-8 | News-16 |
|---|---|---|---|---|
| DRNet | 1.68 | 8.99 | 5.71 | 7.30 |
| MLP | 2.95 | 10.99 | 8.98 | 10.73 |
| TARNet | **1.55** | 9.47 | **5.28** | 7.36 |
| CF | 9.44 | 8.64 | 11.82 | 13.05 |
| kNN | 4.50 | **4.67** | 6.07 | **6.48** |
These results were run only in 1 run as running them with CF and kNN already took 32 hours in total. We acknowledge that 1 run does not lead to significant conclusions and that is why we ran the deep models over 5 runs as well. We found the following results:
| Model | News-2 | News-4 | News-8 | News-16 |
|---|---|---|---|---|
| DRNet | **6.34** | 2.42 | **8.05** | 7.84 |
| MLP | 8.50 | 4.34 | 10.21 | 15.17 |
| TARNet | 6.51 |**2.35** | 8.13 | **7.76** |
You can see that although DRNet did outperform the other models in 2 scenarios, it more or less had the same performance as TARNet. We believe this can be due to the following reasons:
- We only provided around 33% of the given data for training as otherwise it would take significantly longer.
- We only used 5000 samples to test out the models, which might result in the testing not including some outliers, thus giving a more average performance.
- We used a different Python version and deep learning library (PyTorch). TensorFlow 1, which was used in the original paper, has different ways of retaining the derivatives used in backpropagation than PyTorch. This and other framework-specific differences could have had an effect on the results.
- We used different hyperparameters, as not all of them were mentioned in the paper. Those that were mentioned are the same, but we used the default values for those that were not.
- Other small differences between our model and the authors' model could have been introduced as a result of us reproducing purely based on the description of the model.
In the following figures you can see the loss of the 4 News datasets throughout the training phase. This is mainly to confirm that the models indeed converged during our experiments. It can also be seen that it takes longer for the DRNet loss to go down compared to the other models, which is due to the additional layers it has, adding extra complexity to the model.
 
 
### Hierarchy
As already described previously, we also tried to remove the hierarchy from DRNet and again ran the experiment on News-2 over 5 runs. The results were the following:
| Model | RMISE |
|---|---|
| DRNet | 5.18 |
| MLP | 4.54 |
| TARNET | **4.11** |
We saw that DRNet's score dropped rapidly which lead us to the conclusion that the hierarchy does indeed matter. We believe this is mainly due to not enough data being passed through each of the heads, leading to a worse estimate of the expected outcome for each head.
### Custom Data Generation
Finally, we ran the experiment using our custom dataset where we also have a dosage effect in the outcome function. From the table with results below, we observed that none of the models seemed to be able to produce good results, which may be due to the complexity of our custom dataset. However, we can see that DRNet still seems to perform similary to TARNet in this case.
| Model | RMISE |
|---|---|
| DRNet | 83.17 |
| MLP | **80.41** |
| TARNET | 84.91 |
As can be seen in the following loss plot, both MLP and TARNet managed to converge, while DRNet did not. We also observed that the variance of the loss is very high for DRNet. This leads us to believe that DRNet is much less sample efficient than TARNet, as in the end they achieved a similar result in terms of RMISE as can be seen below. This is probably because it is easier to overfit on data which makes the model harder to train.

## Conclusion
The goal of this project was to reproduce the original results on the News datasets by implementing the models and experiments ourselves using only the information provided in the paper. Our results showed that DRNet did not perform as well as presented in the paper, as it barely beats TARNet throughout the different News instances and seems to be quite sample inefficient compared to the other models. When tested on our custom News dataset which has explicit dosage effects, it even performs worse than TARNet and MLP. However, the hierarchical structure of DRNet with shared treatment layers does seem to lead to a better performance compared to removing this hierarchical structure, likely due to the decreased sample efficiency in that case.
The differences between our and the originally reported results are to some degree likely due to the confusing, or even missing, explanations regarding the data generation process or model implementations, such as hyperparameters or specific training details. This made it hard for us to accurately replicate the tested models and experiments. Additionally, some of the differences in results may also come from the fact that we were only able to use about 33% of the News data and used PyTorch instead of TensorFlow 1.
## References
[1] Schwab, P., Linhardt, L., Bauer, S., Buhmann, J. M., & Karlen, W. (2020, April). Learning counterfactual representations for estimating individual dose-response curves. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 5612-5619).
[2] Shalit, U., Johansson, F. D., & Sontag, D. (2017, July). Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning (pp. 3076-3085). PMLR.