**Common Response** We thank all the reviewers for constructive and actionable feedback on the paper. A couple of common concerns include: (i) need for additional experimentation say under much larger graph structure, multivariate setting, and over randomly sampled structural equations etc., and (ii) whether it is possible to have encoding independent of the parent variables. We address these two concerns in this common part. >**Additional Multivariate Experimentation in Larger Graph** We carried out supplementary experiments, as recommended by the reviewers, in which we examined a graph comprising 10 nodes, with each node featuring three dimensions, for a total of 30 dimensions to learn in the graph. This graph has a larger number of upstream parent sets, along with the randomly sampled structural equations. As such, we consider the following **ladder graph** (https://imgur.com/a/V3OSjvU) where each $X_i$ is three-dimensional vector with ground truth structural equation $X_i = f_i(X_{\text{pa}_i}, N_i)$, where $X_{\text{pa}_i}$ are the parents of $X_i$. Each functional relationship $f_i$ is randomly generated by using a neural network with a single hidden layer of 16 nodes with weights randomly sampled between -1 and 1. We consider interventional and counterfactual queries on $X_{10}$ (see referenced image), the furthest downstream node, using interventions on $X_2$ and $X_3$. We choose these interventions as they serve as the most challenging interventions, the numerous intermediate nodes possibly introducing compounding error. Note that we know the ground truth models $f_i$, and, thus, the ground truth counterfactual values, and can sample from the ground truth interventional distribution. The table below provides the performance of the models in the same format as Table 1. Note that (as in Table 1) all entries in the table below are scaled up by $10^2$ for ease of reading. \begin{array}{l r r r r} & \text{DCM} & \text{ANM} & \text{VACA} & \text{CAREFL} \\ \text{Metric} & (\times 10^{-2}) & (\times 10^{-2}) & (\times 10^{-2})& (\times 10^{-2})\\ \hline \text{Obs. MMD} & \mathbf{0.26\pm0.08} & 0.26\pm0.11 & 33.32\pm2.94 & 20.04\pm1.84\\ \text{Int. MMD } &\mathbf{1.49\pm0.27} & 1.53\pm0.37 & 29.83\pm4.22 & 20.46\pm3.45\\ \text{CF. MSE } & \mathbf{176.09\pm225.09} & 318.83\pm464.89 & 2240.19\pm3902.72 & 1507.75\pm2390.64\\ \end{array} We see that DCM still outperforms other compared schemes under much larger graph structure, multivariate setting, and over randomly sampled structural equations. Here, we want to particularly highlight the counterfactual (CF) performance. These results corroborate our prior findings. >**Independence between Encoding and Parents** While intuitively this may seem unrealistic, actually independence may arise in quite simple settings. For example, consider independent random variables $X$ and $\varepsilon$ and let $Y=f(X)+\varepsilon$. Then the deterministic function $g(X,Y)=Y-f(X)$ is statistically independent of its input. The better the approximation of $\hat f$, the less the dependence [1]. Another example naturally arises in linear regression. It is well known that the fitted values $\hat y$ are independent of the fitted residuals $y-\hat y$, another deterministic function. To further illustrate this above point in our proposed model, we empirically evaluate the dependence between the encoding and fitted values. For this, we performed an additional experiment, where we consider the bivariate nonlinear SCM in our paper of $X \rightarrow Y$ and evaluate the HSIC between the $X$ and the encoding of $Y$. We fit our model on $n=5000$ samples and evaluate the HSIC score on $1000$ test samples from the same distribution. We compute a p-value using a kernel based independence test [2] and compare our performance to ANM, a correctly specified model in this setting. Furthermore, we perform this experiment $100$ times. Given true independence, we expect the p-values to follow a uniform distribution. In the table below, we show some summary statistics of the p-values from the $100$ trials, with the last row representing the expected values with true uniform p-values (which happens under the null hypothesis). \begin{array}{ l r r r r r r } & \text{Mean} & \text{Std. Dev} & 10\% \text{ Quantile} & 90\% \text{ Quantile} & \text{Min} & \text{Max} \\ \hline \text{DCM}& 0.196 & 0.207 & 0.004& 0.515& 6\text{e-}6& 0.947 \\ \text{ANM}& 0.419 & 0.255 & 0.092 & 0.774 & 3\text{e-}5& 0.894\\ \text{True Uniform (null)} & 0.500 & 0.288 & 0.100 & 0.900 & 1\text{e-}2& 0.99\\ \end{array} We provide p-values from the correctly specified ANM approach which are close to a uniform distribution, demonstrating that it is possible to have encodings that are close to independent. Although the p-values produced by our DCM approach are not completely uniform, the encodings do not to consistently reject the null hypothesis of independence. For additional reference, These results demonstrate that it is empirically possible to obtain encodings independent of the parent variables. [1] "Nonlinear causal discovery with additive noise models", Hoyer et al. (2008) [2] "A Kernel Statistical Test of Independence", Gretton et al. (2007) **Reviewer ikzS:** We thank the reviewer for their constructive feedback, and respond to each point below. > Theoretical Contributions Unclear + Identifiability It is true, as reviewer mentions, that some counterfactual queries are not identifiable from observational data without making assumptions on the functional relationships even under causal sufficiency. To make our theoretical contributions clearer, let us restate the implications of Theorem 1, because (while not stated explicitly) under the conditions of Theorem 1, we get identifiable counterfactual estimates. Note that here we only care about the identifiability of the counterfactual estimates, which does not require identifiability of the SCM. There are two parts to it: (a) The Condition 3 that the structural equation $f$ is invertible, differentiable, and increasing with respect to exogenous variable $U$, implies that the counterfactual outcome is identifiable from observational data which is a well-known result that has appeared multiple times in the literature, e.g., see Theorem 1 of [1]. (b) The technical core of Theorem 1 is in showing that Conditions 1 and 4 together implies that we recover the exogenous noise up to a non-linear invertible transformation (see Eq. (9)). Now, this along with (a) implies that our encoder/decoder approach leads to identifiability of the counterfactual estimates. This identifiability part is again known e.g., see Theorem 4 of [2]. In summary, Theorem 1 (without Condition 3) implies our encoder/decoder approach leads to identifiability of the counterfactual estimates. In Theorem 1, we focused on estimation error, which is a stronger statement, but now requires another assumption on encoder/decoder performance expressed in Condition 3. > I would like to see experiments on graphs with more than 4 nodes (to analyse, e.g., the influence of larger parent sets), randomly sampled structural equations, discrete variables, multi-variate exogenous variables, etc; as well ablations on the different design choices in DCM. See additional experiments reported in the common response part. We hope these results address reviewer's concerns. >Poor performance of VACA and CAREFL in Table 1 To ensure correctness, we used the code provided by the respective authors and the exact hyperparameters as the ones used in the original experiments (see our attached code). As for the nonadditive noise settings, it is possible that VACA and CAREFL overfit and struggle to generalize, whereas ANM may be interpreted as a first order linear approximation, and therefore may still obtain competitive results. > Standard errors for Table 2 We would be happy to add. For example, for DCM, we compute the standard error by evaluating 10 random initializations and find that the standard deviation of the median absolute error (0.0062) is quite small. > Can you comment on whether/how DCMs could handle more general types of interventions, e.g., stochastic, imperfect, soft ones? While we focused on the popular do-interventions in the paper, our current framework is also applicable to conditional interventions as defined in [6], where we replace $X_i$ with a deterministic function of some observables $X_{\text{pa}_i}$. Extending to other suggested intervention types is an open problem. >How can epistemic uncertainty (from lack of data) be taken into account in DCMs in a principled way? One could potentially use ensembles of diffusion models to approximate epistemic uncertainty, an idea that was explored in [7]. > The paper would benefit from a more extensive discussion of limitations... In particular, I would like to hear more about the restrictiveness of the assumptions of Thm.1: Thank you for sharing these concerns, we can elaborate on some of the discussions above on the limitation side. As for the 1d exogenous noise restrictiveness of Theorem 1, we provide the multivariate extension in Theorem 3. Also, as mentioned above, the assumptions on continuous, invertible, and monotonic noise (Condition 3) are standard in the literature for identifiability, e.g., Theorem 1 of [1], Theorem 5 of [2]. Similar assumptions are also utilized for identifiability results in related models, such as ANM (see Section 3 in [3]), post non-linear models (see Section 2 in [4]) or heteroscedastic noise models (see Section 4 in [5]). [1] "Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation", Lu et al. (2020) [2] "Counterfactual (Non-)identifiability of Learned SCMs", Nasr-Esfahany and Kıcıman (2023) [3] "Nonlinear causal discovery with additive noise models", Hoyer et al. (2008) [4] "On the Identifiability of the Post-Nonlinear Causal Model", Zhang et al. (2009) [5] "Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model", Strobl et al. (2022) [6] "A Calculus for Stochastic Interventions: Causal Effect Identification and Surrogate Experiments", Correa and Bareinboim (2020) [7] "Uncertainty in Neural Networks: Approximately Bayesian Ensembling", Pearce et al. (2020) **Reviewer u3or:** We thank the reviewer for constructive feedback on the paper, and address each of the raised points below. >I do not think it is reasonable to claim that the proposed model can perform counterfactual sampling. Authors implicitly assume that evidence contains all the nodes. In a general counterfactual query, evidence is typically a (small) subset of the observed variables. Thus the title itself is not very precise about the contribution of the paper. We believe these comments may be indicative of a fundamental misunderstanding of our assumptions in the paper. As stated explicitly, we assume "causal sufficiency" (line 9, right column) and our current setup precludes unobserved confounding and requires all variables observed when training and computing a counterfactual. While it may limit its applicability in certain scenarios, we do disagree with reviewer's assertion that it limited in scope. In fact, the long list of cited literature in Section 1.1 relies on the same assumption, including methods that we closely compare against: VACA (Sanchez-Martin et al., AAAI 2022) and CAREFL (Khemakhem et al., AISTATS 2021). We believe that all these publications reflect the scientific communities view that developing techniques for counterfactual estimation under the full observability setting is still interesting. >The experimental evaluation is a bit weak and does not fully utilize the representative power of diffusion models, taking away from the contribution of the paper and the value of the proposed method compared to the classical causal reasoning algorithms. These algorithms as baselines are also missing. Could you clarify which baselines you would like us to include? We currently include the SOTA techniques for interventional and counterfactual inference in Pearl's graphical causal model framework in a non-additive noise setting, along with an additive noise model that employs model selection across $10$ popular machine learning algorithms. > "For root nodes, we sample Xi from the empirical distribution" Are diffusion models not trained for root nodes? If they are, why are they not used? If they are not, how are you doing counterfactual sampling? We have functionality to train diffusion models on the root nodes but choose not to, since interventional and counterfactual queries are equivalent to observational sampling for root nodes, and also to reduce computational cost. For example, in the graph $X \rightarrow Z \leftarrow Y$, if we would like to intervene on $Y$, we only need to sample $X$ from the marginal distribution, which we do not necessarily need a diffusion model for. In the setting where root nodes may be high dimensional, sampling from the empirical distribution may be less effective, in which case we could fit a diffusion model. We will further clarify this in the paper. > "The encoding is independent from the parent values" This does not seem like a realistic assumption. Clearly, the encoding/decoding functions learned by a diffusion model will be dependent with data. See discussion on independence between the encoding and parents in the common response. We hope that this addresses the reviewer's concerns. > "The structural equation f is invertible, differentiable, and increasing with respect to U" Could you compare this assumption with Pearl's monotonicity assumption for counterfactual identifiability? Pearl's monotonicity assumption was introduced in the context of a setting with only binary variables. We state a standard generalization of Pearl's monotonicity assumption, see e.g., Theorem 1 of [1] or Theorem 5 of [2]. Similar assumptions are also utilized for identifiability results in related models, such as ANM (see Section 3 in [3]), post non-linear models (see Section 2 in [4]) or heteroscedastic noise models (see Section 4 in [5]). >I am not sure if I understand the statement "for a sample u". U is not observed. Do you mean that the reconstructed counterfactual for each u is bounded, so the counterfactual random variable also will be "close" in some sense? As you mentioned, we do not observe $U$, but due to the invertibility of $f$, the observed pair $(x_{\text{pa}},x)$ uniquely characterizes $u$. Our theorems hold for all triples $(x_{\text{pa}},x,u)$, which implies that the true counterfactual random variable $X^{CF}$ and our estimate ${\hat X}^{CF}$ are close with respect to metric $d$ almost surely. We have modified our text to emphasize that $U$ is not observed. >Why is Theorem 3 in appendix not referenced in the main paper? In the discussion of Theorem 1 of the main paper, the fourth paragraph discusses our multivariate generalization of Theorem 3. Specifically, we reference Theorem 3 on Lines 241 and 251 (right column). >Is it possible to add baseline experiments with discrete variables with known SCMs where the exact interventional and counterfactual queries can be computed and compared with the ones obtained via diffusion models? While in theory we can handle different data types, we see the extension of the implementation as future work. Note that, however, we do obtain exact ground truth values for the interventional and counterfactual samples in the artificial data experiments, as we have access to the true data generation process. >Similarly, I think it would be valuable to add high-dimensional experiments since that is what diffusion models and generative models in general are good at and useful for. See additional experiments reported in the common response part. We hope these results address reviewer's concerns. >I believe currently the authors use empirical interventional and counterfactual distributions as baselines. But this adds error to the baseline due to the very few number of samples (100 or 1000 used). It would be better if the authors could compute the closed-form expression of the interventional and counterfactual distributions as the baseline and compare the empirical distribution obtained from the diffusion model with these. For counterfactual queries, we compute and use the exact closed-form values to compare against the estimated values, resulting in no error in the baselines. For interventional queries, we sample values from the exact closed-form distribution to compute the MMD. Since these are univariate distributions, we believe 100 and 1000 samples are sufficiently large. Furthermore, the MMD is dominated by the actual discrepancy between distributions and not sample variability. [1] "Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation", Lu et al. (2020) [2] "Counterfactual (Non-)identifiability of Learned SCMs", Nasr-Esfahany and Kıcıman (2023) [3] "Nonlinear causal discovery with additive noise models", Hoyer et al. (2008) [4] "On the Identifiability of the Post-Nonlinear Causal Model", Zhang et al. (2009) [5] "Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model", Strobl et al. (2022) **Reviewer by9v:** We thank the reviewer for their constructive feedback, and respond to each point below. >The writing, in some places, is not mathematically rigorous... > Thank you, we have edited the notation to be more specific so that lowercased letters refer to the values and uppercased letters to the random variables. >Some of the assumptions of Theorem 1 seem hard to satisfy and the paper does not provide enough empirical support on how sensitive the performance is to those assumptions. We agree that some of the assumptions are hard to test, even though necessary for achieving identifiability. In practice, even under violations of these assumptions, our DCM approach provides good empirical performance. We will quantify this better in the experimental section. >In line 204 (left column), what do you mean by equivalent random variables? We mean equality up to invertible transformation, $Z_i=g(U_i)$ for an invertible function $g$. >In line 258 (left column), it is assumed that the encoding is independent of the parent variables. In the proposed diffusion process (eq. 3) the mapping from to is an ODE which is consequently deterministic. How can the output of a deterministic function be statistically independent of its input? We answer the question in the common response. >How sensitive is the performance of the proposed method to this independence assumption? A quantitative study that e.g. plots a performance metric against the mutual information of and should help. See additional experiments reported in the common response part. The independence assumption is mostly satisfied in our experiments. >Due to the nature of diffusion models, the latent variable is of the same dimension as the data. This might be limiting for cases where the dimension of is not the same as the dimension of in a DAG, especially when identifiability is important. We address this in the third paragraph of our discussion of Theorem 1. Our multivariate theorem suggests that our latent variable should be equal to the dimension of the node we are modeling, supporting the usage of models that retain the latent dimension such as diffusion models. >The assumptions of Theorem 1 put constraints on the causal mechanisms (e.g. assumption 3) of the DAG. Yes, they do, but these assumptions are not strong and in fact standard for achieving counterfactual identifiability, see e.g., Theorem 1 of [1] or Theorem 5 of [2]. [1] "Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation", Lu et al. (2020) [2] "Counterfactual (Non-)identifiability of Learned SCMs", Nasr-Esfahany and Kıcıman (2023) **Reviewer hH5j:** We thank the reviewer for their constructive feedback, and respond to each point below. >Experimental results sometimes show that a simple additive noise model implementation from the popular DoWhy library can outperform this method when the true data-generating process follows the said additive noise model. This behavior is to be expected. If the ANM is the correctly specified model, then the ANM encoding should be close to the true encoding assuming the regression model fit the data well. In this case, we should not expect DCM to outperform the ANM. In the nonadditive settings, DCM outperforms the other methods, although the ANM remains relatively competitive. This may be due to the fact that the ANM can be interpretted as a first order approximation of the structural equation. >A fully specified causal graph is needed for this technique to work, which may limit its use in several real-world applications where this full information is not always available. While we assume causal sufficiency for theoretical results, this method may be applied when the causal graph is not fully known, although then there are no guarantees on performance. >While the theoretical result on counterfactual error is shown to extend to higher dimensional covariates under stronger conditions, there are no experiments with high-dimensional covariates to demonstrate whether this method is useful under these settings. See additional experiments reported in the common response part. The independence assumption is mostly satisfied in our experiments. >Were any other experiments on real-world data tried (especially any experiments with higher dimensionality of input nodes)? We did not have other experiments on real-world due to the inherent difficulty of obtaining groundtruth counterfactuals and intervention values. >Is there a reason why the Diamond configuration of the causal graph is particularly hard for the diffusion causal model to perform sampling (as compared to the simple ANM baseline)? We do not believe it is particularly hard for DCM, as it still outperforms VACA and CAREFL, instead the specific choice of structural equations may be particularly easy for the ANM. **This comment is for Reviewer ikzS** In the rebuttal (https://openreview.net/forum?id=iYpJyfCMT3&noteId=4a9h2kWvN4), we made a small typo. The sentence "In summary, Theorem 1 (without **Condition 3**) implies our encoder/decoder approach leads to identifiability of the counterfactual estimates. In Theorem 1, we focused on estimation error, which is a stronger statement, but now requires another assumption on encoder/decoder performance expressed in **Condition 3**." should be replaced "In summary, Theorem 1 (without **Condition 2**) implies our encoder/decoder approach leads to identifiability of the counterfactual estimates. In Theorem 1, we focused on estimation error, which is a stronger statement, but now requires another assumption on encoder/decoder performance expressed in **Condition 2**." We apologize for this oversight on our part. **Note to AC** We thank the reviewers for their efforts. Unfortunately, there were a few basic misunderstandings/erroneous comments that we wanted to bring forward to your attention. * Reviewer u3or mentions that we *implicitly* assume that all variables are observed. This is not correct. We actually were quite *explicit* about it (starting from Abstract, Introduction, and Preliminaries), that we assume causal sufficiency and our scheme requires all variables to be observed when training and computing a counterfactual. We also strongly disagree with reviewer comment that this somehow leads to "very very specialized counterfactual queries". Causal sufficiency is a well-accepted assumption in causal inference literature, and in fact underlies all the related work that we cite. * Reviewer ikzS asks about identifiability of counterfactual estimates (Main Weakness 1), which is already covered by Theorem 1, as we explain in the rebuttal. The conditions in the theorem are precisely set for the purpose of identifiability. Additionally, as requested by reviewers, we have now provided additional experiments that cover larger graph structure, multivariate setting, and over randomly sampled structural equations. Also, to convince the reviewers that the encodings can be independent of the parent variables, we have now provided both intuitive explanations and empirical evidence. **Response to Reviewer u3or** Dear Reviewer, Thanks for your clarification. We also apologize for the misunderstanding on our part. Actually, we *do not* need to observe all variables for answering a counterfactual query. The standard formal criteria within the framework of graphical models ("backdoor", "frontdoor", etc.) that indicate which set of variables is sufficient holds for us too (as defined in Chapter 4 of "Causal Inference in Statistics: A Primer" by Pearl). Please note that our work proposes a new model class for modeling structural causal models that is more flexible than existing ones, but we do not introduce a new technique for estimating counterfactuals in structural causal models where we follow standard ideas from the literature. We assume that the reviewers' confusion stems from Algorithm 1 where we include all variables for training. The reason we do this is because, a priori, we do not assume anything on the queries, i.e., we allow for all possible target variables and intervened variables. To achieve this goal, we train a DCM model on all the nodes to handle any possible counterfactual query. Even with this general goal in mind, we do not need all variables to be always observed jointly. For example, if you examine Algorithm 1, it can use partial observations as we only need the child and its parents to be observed jointly. Alternatively, if we know beforehand that our primary interest lies in determining the counterfactual outcome of a particular variable $Y$ subject to an intervention $do(X:=x)$ on $X$, then the problem becomes simpler. For example, an idea would to be take any backdoor admissible set $Z$ relative to $(X,Y)$ and simply reduce the graph to these variables. This means that we only need to observe $(X,Y,Z)$. Taking your DAG $X\rightarrow Z \leftarrow Y \leftarrow T$, if we only observe $x, z, t$, and if we are interested in the counterfactual outcome of $Z$ given an intervention on $X$ and/or $T$, then it suffices to fit a DCM model to approximate the SCM $Z=f(X,T,U_Z)$ omitting $Y$ completely (where $U_Z$ is the exogenous noise variable associated with $Z$). We again thank the reviewer for pointing this confusion, and will emphasize this in the paper along with rephrasing of Algorithm 1. If you have any additional questions, please reach out to us. **Response to Reviewer u3or** We thank the reviewer for the active engagement. Firstly, in the graph $Z\rightarrow X, Z\rightarrow Y, X\rightarrow Y$, that is right, we need $Z$ to be observed, because if we do not observe $Z$, it would be a hidden confounder of $Y$ and $X$, which we exclude by the causal sufficiency assumption. So this is not an applicable example. Maybe an applicable example, would be $Z\rightarrow Y \leftarrow X$, in where we do not need to observe $Z$ for the counterfactual query $\text{p}(Y_x|x')$ (as $Z$ becomes part of the noise $U_Y$) In general, **any** counterfactual query that is identifiable given the observational data in Pearl’s graphical causal model framework, is also identifiable in our setting (as we just propose a new model class, which does not change the way how interventions and counterfactuals can be estimated in structural causal models). So, yes the reviewer is right, there are counterfactuals that are not estimable in Pearl's framework, which also applies to us, but that **can not** be regarded as shortcoming of our proposed model class. **Response to Reviewer u3or** We apologize for this confusion. We misunderstood the previous comment with the graph $Z\rightarrow X, Z\rightarrow Y, X\rightarrow Y$, we thought that the reviewer meant data from $Z$ was not observed during both training **and** while estimating counterfactuals. But now, we realize that the comment was meant only for the counterfactual query. Since our DCM model is approximating the true SCM $Y=f(X,Z,U_Y)$ and if the value of $Z$ is missing in the query then our current approach can not compute the counterfactual $\text{p}(Y_x|x')$ because our DCM model expects an input for $Z$. We agree that this is a reasonable concern, and we thank the reviewer for their input. We solely consider counterfactuals where all nodes are observed as evidence. However, we would like to emphasize that this shortcoming is not unique to DCM, and also underlies VACA [1], CAREFL [2], heteroscedastic noise model [3] etc. Note that the methodology of counterfactual estimation following Section 3.3 [4] using functional causal models such as [5, 6, 7] requires full observability of the variables as well for a counterfactual point estimate. Similarly, “Deep Structural Causal Models for Tractable Counterfactual Inference” [8] and "Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation" [9] claim counterfactual inference but operate assuming the evidence is the entire graph. These methods all make the same assumption on the evidence of the counterfactual queries, so we would like to point out **this is the norm, not the exception**. Although we understand the concern of the reviewer, we want to emphasize that we compare DCM to other SOTA models in this realm and DCM should not be penalized for this limitation. We are grateful for this discussion and will add explicit remarks about the assumption of observability during inference. [1] "VACA: Design of Variational Graph Autoencoders for Interventional and Counterfactual Queries", Sanchez-Martin et al. (2021) [2] "Causal Autoregressive Flows", Khemakhem et al. (2020) [3] "Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model", Strobl et al. (2022) [4] "Elements of Causal Inference", Peters et al. (2017) [5] "Nonlinear causal discovery with additive noise models", Hoyer et al. (2008) [6] "Causal Inference on Discrete Data using Additive Noise Models", Peters et al. (2009) [7] "On the Identifiability of the Post-Nonlinear Causal Model", Zhang et al. (2009) [8] "Deep Structural Causal Models for Tractable Counterfactual Inference", Pawlowski et al. (2020) [9] "Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation", Lu et al. (2020)