# CEO rebuttal ; you should be able to comment directly here by double-clicking sentences
# Note: you can also edit directly by clicking on the pencil button in the upper left
# Link to reviews (for convenience): https://notability.com/n/2nhH40_AxA7t2FXI~dF4dW
Thanks for reading our response.
Firstly, we are glad that many of your original claims have been clarified. For example, as we clarified: we do combine interventional and observational data; CBO does not naturally extend to discrete or mixed variables; objectives in Table 1 do not directly apply to our setting; we have indeed compared to CBO in all our experiments. We hope these clarifications will be taken into consideration.
We now address the points raised here.
* > "The claim that CEO “extends” CBO is (technically) inaccurate [...] "
* You insist on the assumption of causal sufficiency specifically being the key to prevent our work to extend CBO. We have motivated causal sufficiency extensively in our previous response and in the paper, providing a large amount evidence that doing CD (let along causal global optimization under unknown graph, unexplored before our work) is still largely resorting to this assumption. The assumption is clearly stated in the paper, and we welcome any concrete suggestions to make it clearer that it is used in our work. Finally, we note causal sufficiency (CS) is not a stronger assumption than assuming knowledge of the true graph. If it were, there would not be so much interest in the CD literature in developing methods assuming CS to be able to discover the true graph. You also do not (at least, explicitly) acknowledge the important contribution (in theory and practice) that CEO's acquisition naturally handles noisy observations of the causal effect, while CBO's acquisition does not (noise is directly implied by the SCM formulation).
* It is unclear, to us, what
* > "utilize this additional assumption coherently? Systematically? Elegantly? "
* precisely means. We exploit causal sufficiency to define a tractable (closed form) likelihood of the graph given observational and interventional data. In the presence of unobserved confounders, it would be complicated to define a likelihood of the graph, and one could (e.g.) model latent confounders as latent variables, with in general intractable marginalization entering the problem. We hope this can further clarify its role.
* We first address your quote:
* > "Moreover, you should notice that CD-CBO is an actual extension of CBO, without any additional assumptions. On the other hand, CEO needs causal sufficiency as well as various other assumptions/heuristics such as equation 8 "
* ***This is incorrect***. **CD-CBO uses CD in its first part. To do this as in CEO requires causal sufficiency (CS) for exactly the same reasons**. In case you refer to using other CD methods that are not applicable to our setting (these also employ CS) for a comparison with CEO (recall, CEO operates in the active learning setting, selecting sets *and* values), see lines 265-269 in the original paper. Also, as we said in our response, we believe it is misleading to portrait CEO as making heuristic assumptions that CBO would allegedly not make in the acquisition function. We addressed this, improved the paper with a further explanation. Again, CBO uses CEI as acquisition, a completely different objective. Note CEI also does not have theoretical guarantees (regret bounds). This discussion is orthogonal to how CEO approximates its own objective, namely CES. Also, CBO's acquisition does not handle noise. Finally, you continue to deem the empirical comparison with CD-CBO inappropriate. However, you do not propose any specific or concrete *details* for an experiment that would resolve any *specific* doubt, as opposed to a generic "not being convinced". We stand by the original experimental setting. Further to your review we provided additional evidence. We are always open to any suggestions for additional experiments, as long as they are concrete, specific, and seek to answer a research question that has not been addressed.
* To address your quote:
* > "As background, note that the method proposed in the paper relies on tuning of the hyperparameters, which is standard in the literature"
* It is unclear what hyperparameters you refer to here. Regarding the treshold, you quote:
* > "It is not evident that this choice for the threshold and other hyperparameters favor CD-CBO"
* Firstly, again it is not clear what you mean by "other hyperparameters" - this is vague. Secondly, you mention "it is not evident", but provide no guidance on precisely *what evidence* you would like to see except a generic (quote):"different sample size, graph size, and hyperparameters". We are doing our best efforts to interpret what you mean and clarify. Any *settings* mentioned in your original review like initial data sizes, graph sizes, kernels, number of acquisition points are **the same across algorithms**, as mentioned in the paper and our response. It should be clear that e.g. initial data size is not a "hyperparameter" of the algorithm. There are no hyperparameters to tune that would somehow make CEO be favoured against CD-CBO, and you mention none except mentioning the treshold. The treshold is also not a "hyperparameter"; rather an external setting. For example, when solving a real-world problem, one would not attempt to "tune" the treshold to obtain better performance. Setting the treshold in CD-CBO to something lower (e.g. 80 or 70 or 60 percent) defeats the entire purpose of the name of the algorithm; it is supposed to *find the graph* first, then optimize. If it is not allowed to find the graph by stopping it unreasonably early, it is not CD-CBO. Again, the reason for the existence of the treshold is simple. Any practical method that uses a posterior which needs to concentrate around a true solution needs to decide when the true solution is found. In practice, for numerical reasons one cannot get exactly to $1$; hence why the treshold is needed. We will report results with other reasonable choices of tresholds (greater than 90) in the final version of the paper, as they do not qualitatively change the performances.
To address your final paragraph:
> It could be extremely insightful, if the authors provided a thorough discussion supplemented by evidence that the graphs with larger posterior mass which are selected by CEO are the useful graphs for identifying the query of interest
In fact, **we have provided** a discussion and experimental evidence on this already in the original paper. When the true graph makes CBO perform significantly better than the wrong ones (as in Fig. 3 (a),(b)), indeed as mentioned in lines 308-309, this is evidence that the graph is useful here for optimization. Therefore, since our acquisition balances finding the true graph and optimizing the objective jointly, CEO indeed focusses on the most useful graph for the query of interest. When the graph does not affect the causal prior significantly, as in Fig. 3 (c) where CBO on true graph and wrong perform similarly, CEO focusses on optimizing the objective. In Fig. 2, we also provided further intuition on a simpler concrete example on when a graph can be useful for causal optimization, and that some graphs can be equivalent for this purpose.
> Given the current state of the work, even if it was convincing that CEO outperforms CD-CBO, the reason behind it is not clear/evident
While we gave explanations in the original paper for why this is the case in Section 5.2, we are happy to expand further on this in the final version with additional space. The purpose of our paper is to study the problem of causal global optimization under unknown graph. While we never found that CEO is outperformed by CD-CBO in any case, even if this happened in some settings, our contributions would still stand. Again, the purpose of the paper is to study and analyse this problem, and if a modified version of our method (note, CD-CBO has not been proposed before) such as CD-CBO was better in some circumstances, this would be an additional interesting finding to further analyse. Outperforming all baselines in all settings no matter what the circumstances is not the aim of our work. For example, as we very clearly discussed and pointed out, it is expected that in many cases CBO on the true graph performs better than CEO. Again, the aim of our paper is to study the problem and provide understanding around this previously unaddressed setting.
## ORIGINAL REBUTTAL : OLD , IGNORE FROM HERE
## Reviewer m9mV
Thank you for your constructive feedback. We hope that our replies clarify all your points. We address each of your comments and suggestions below, in point and order. Any line numbers refer to the original paper, not the now revised version.
* **Extension of CBO** : "*\[...\] claims to be an extension of the CBO algorithm, while it fails to achieve this goal*". We respectfully disagree and clarify; CEO accounts for uncertainty over the causal graph structure, and additionally for uncertainty from observation noise which is implied by the SCM. CBO does not take either of these into account. None of the works in [4,13,14] attempt to study what happens to the optimization when the graph is wrong (as we did, both in our experiments and also in Section 3.2 and Figure 2). Further, the acquisition used by CBO, CEI [4] is based on the expected improvement, which is well known to be ill-defined for noisy settings [9,21]. In addition, note that CBO’s prior only depends on observational data, while CEO's prior can handle both interventional and observational data. Finally, our acquisition function is of independent interest, and could be used e.g. for joint causal effect estimation and graph learning.
1. **Common confounders/Causal sufficiency** : We assume you mean unobserved confounders when mentioning “common confounders”. We believe that a further methodological extension of CBO to handle **both** an unknown graph **and** unobserved confounders at the same time to be out of scope for this paper. We exploit causal sufficiency (no unobserved confounders) to define a tractable likelihood of the graph given observational and interventional data. This assumption is widespread both in the broader causal discovery (CD) literature [1], but also more specifically in Bayesian approaches to CD and inference [2,3,5]. While CBO can optimize the causal effect under unobserved confounders, it still cannot handle an unknown graph. Removing causal sufficiency when the graph is unknown is an ongoing research avenue in CD, with a large number of recent state-of-the-art CD methods **still** resorting to this assumption (e.g. [6,7,8,11,12,19]).
2. **Combination of interventional and observational** : "*CEO can not handle more than one type of input data*". This is incorrect; we remark that CEO ***does*** allow for the combination of observational and interventional data. In fact, all our experiments use a combination of observational and interventional data. We now mention this explicitly just below Eq. 3 and 4. Recall that $\mathcal{D}$ is defined (lines 85-86) as the union of interventional and observational data. The posterior $P(G|\mathcal{D})$ therefore depends on both data types and enters the definition of both the surrogate models and the acquisition function. Finally, note that CBO's prior instead only depends on observational data.
3. **Discrete or mixed continuous/discrete variables**: This is incorrect; CBO does not naturally extend to discrete variables, or to mixed continuous/discrete variables. Quoting CBO [4]: "Indeed, when all intervention variables $\mathbf{X}$ are binary, the CGO objective reduces to the causal MAB setting" and quoting the follow-up [13]: "Differently from MAB we consider interventions on continuous variables [...]". When the set of manipulative variables under considerations is discrete-valued, the problem target by CBO becomes that targeted by the literature on Causal Bandits (e.g. [18]). A in-depth comparison and relation between the setting considered by CBO and CEO (i.e., causal global optimization) and Causal Bandits has been provided in previous works on CBO [4 Sec. 1.2,2.3; 14 Sec. 1.1; 13 Sec 1.2]. Finally, we are not aware of works handling a mix of discrete/continuous manipulative variables in the causal global optimization setting.
4. **Assumptions** : "*CEO makes assumptions about certain functional forms (e.g., Eq. 8) to make the inference step easier, while CBO offers a more natural inference flow.*". It is unclear to us what "natural inference flow" means, but we make our best efforts in responding below. **Comparing CBO and CEO** CEO's acquisition function does not make more restrictive functional form assumptions than CBO's acquisition; they target different objectives. CBO uses CEI as acquisition, which is based on the expected improvement (EI). In fact, it approximates the true CEI, which is given by an integral. CEO's uses CES, which is inspired by Max-Value ES, and also requires approximation. **Motivation behind Eq. 8**: Our Eq. 8 is an approximation to the true $p(y^\star | \mathcal{D})$, which is included in the definition of CES. The true $p(y^\star | \mathcal{D})$ is highly intractable and has no closed form. We employ a mixture of $|\mathbf{ES}|$ components, where each component is the local distribution on $y_{I}^{\star}$. Recalling the definition of $y^\star$ as the global optimum (line 176) and that $y_{I}^\star=\max_{\mathbf{x}_I} f_{I}(\mathbf{x}_I)$ ( where $I$ denotes the index of $\mathbf{X}_{I}$), the motivation behind this approximation is that (in the limit) the weights of the mixture distribution in Eq. 8 gradually concentrate around the optimal intervention set $\mathbf{X}^{\star}$, and $p(y^\star \mid \mathcal{D})$ turns into $p(y_{I}^{\star} \mid \mathcal{D})$ for the $\mathbf{X}_I$ s.t. $\mathbf{X}_I =\mathbf{X}^{\star}$ when using a Upper Confidence Bound (UCB) policy for $P(\mathbf{X}^\star)$. UCB has theoretical guarantees that can be found e.g. in [9]. We have now included a motivational paragraph on this below Eq. 8.
* **Theoretical claims/justifications**: We added an explicit list of assumptions needed to ensure convergence of our $P(G| \mathcal{D})$ to a Dirac mass on the true graph $\mathcal{G}$ in detail Supplement Section B, while also mentioning them in Section 3.3 of the main paper. Briefly, these are: causal sufficiency; that all variables can be manipulated; infinite samples can be obtained from each node; "minimality" (i.e. the joint in line 74 does not Markov-factorize w.r.t. to any *sub-graph* of $\mathcal{G}$); support of $P(G|\mathcal{D})$ includes $\mathcal{G}$. Regarding the acquisition, guarantees are harder to characterize and directly dependent on the BO literature. For example, very common acquisition functions in BO such as Expected Improvement (which CBO's acquisition is based on), Entropy Search (ES) and Max-Value Entropy Search (MES) (which CEO is partially inspired from) have little or no theoretical guarantees (regret bounds) (a very recent paper provides some for EI [17]). Note also that CBO targets a more challenging problem than BO, and our CEO targets *an even more challenging* one: guarantees for our setting are therefore even harder to obtain than either CBO or BO.
* **Empirical comparison with CBO** "*It is not evident from the experiments that CEO has significant superiority over CBO given its more restrictive assumptions* ": We disagree and the evidence is: CEO outperforms CBO when run on wrong graphs significantly in Fig 3(a,c,d) and Table 1. It is only outperformed by CBO *when run on the true graph*; this is acceptable and expected, as a major point of our work is that one can still solve the causal optimization problem *even if the graph is unknown*. It is also our contribution to show with experiments and explain with intuition (Fig. 2) why exact knowledge of the graph is not always needed for good convergence to the optimum. [causal graph more restrictive ...]
* **Computational complexity**: We have now added a paragraph in Section 3.3 about complexity and a new Section in the Supplement with more details. The complexity of CEO is driven by the computation of CES. Our implementation indeed uses KDEs, but it does so only to estimate *univariate* marginal distributions, which is not very expensive (no curse of dimensionality). **Graphs**: We have stated clearly in lines 138-143 that in our setting since the number of nodes is not too large, there is a tractable number of graphs that can be enumerated (similarly to related work [16]), therefore omitted in the above Big-O notation. As mentioned in lines 138-139, larger spaces of graphs could be explored in future work by sampling with MCMC as common in the structure learning literature, losing however our closed form expectations over the graph. Work that explores the full space of DAGs with many nodes will necessarily introduce additional approximations. We also discussed scalability in the limitations in Section 6. It is a general limitation, not restricted to CEO, that in causal global optimization problems one needs to train $|\mathbf{ES}|$ GPs ( in general $|\mathbf{ES}|$ will scale as $2^{|\mathbf{X}|}$), which does not scale with large graphs.
* **Additional hyperparameters**: our settings are consistent with those in all previous works on CBO [4,13,14] in number of initial samples, iterations, graph size, number of replicates, choice of GP kernels. We are updating the Supplementary material to provide more details on these, and as mentioned in Section 5 we will share code. We provide some intuition regarding the main parameters. ***Initial data size***: one needs enough observational samples for the SCM functions to be well fitted (this is around $150/200$ in our settings). Note all algorithms share the same fitted SCM functions. Second, one cannot have too many initial interventions, otherwise all algorithms converge too fast and differences cannot be detected. Having few initial interventions helps detecting differences between algorihms. **Graph size**: first, we used all possible graphs of size $3$. Then, we explored SCMs corresponding to graphs of size $5,6,10$. These sizes are consistent with previous work on CBO, and about the sizes of graphs used in real-world applications of CBO ( econometrics [22; page 3, line 13], healthcare e.g. drugs cocktails for curing diseases ranging from cancer [23] to viral infections [24] include a very small number of drugs). As mentioned in our limitations in Section 6 and previously, scalability to larger graphs is challenging. This limitation is *directly inherited from CBO*. Reducing the search space for interventions, *with their cost taken into account*, is a challenging area of future research in causal global optimization. Regarding total number of graphs, indeed in our setting we maintain a list of all possible graphs under consideration, (we describe this in Sec. 3.2 lines 138-143) analogously to related work on causal discovery [16]. Because of this, we can compute $P(G|\mathcal{D})$ *in closed form*. This has the important benefit of reducing the overall degree of approximations in computing the joint MI on $G$ and $y^\star$. These advantages are complementary to the setting with many graphs, as was previously discussed in related work with a similar setting (where they are only concerned with structure learning)[16].
* **Sequential approach**: We stand by our experimental setting and believe it answered the specific research questions we posed in the paper. *The paper claims that it is possible to perform causal optimization when the graph is unknown, and that exact knowledge of the causal structure is not always needed for efficient causal global optimization*. We provide methodological reasoning, experimental evidence on 4 sets of SCMs and intuition (Fig. 2) to back up these statements and additionally for why in important scenarios, if the interventions are first collected to discover the graph and only afterwards they are selected to optimize the causal effect, this can lead to slow convergence. **To provide additional evidence, we added an experiment with a comparison to CD-CBO on the Epidemiology example**. In order to compare here, since in the original settings CD-CBO corresponds to CBO as the graph is found at initialization (as mentioned lines 300-302), we updated all graph posteriors of all methods with interventions only. In this way, the graph is harder to find and indeed we could compare to CD-CBO.
* **Questions**:
* We ***have compared*** CEO to CBO in all our experiments. The main novelties are (a) the novel surrogate model definition with the graph integrated out, using a posterior that combines interventional and observational data (b) a new acquisition function that performs causal optimization and graph learning jointly (c) an empirical study where we compare on CBO run on true and wrong graphs. Note that none of the previous works on CBO even studied its behaviour when simply run under wrong graphs, as we did in our work.
* The objectives mentioned in Table 1 **do not apply** to our setting of joint optimization and structure learning. They cannot be used as alternative acquisition functions in CEO. Indeed, we show them since they are either for BO, or for structure learning only. The purpose of Table 1 is to relate our new acquisition with existing works in the related fields of BO and structure learning. No previous entropy-based acquisitions have been proposed for the causal global optimization problem of CBO.
* We have elaborated on our choice for the approximation of the optimal, intractable $p(y^\star | \mathcal{D})$ in point 4 above.
* We respectfully disagree with the statement that our comparison to CD-CBO is not fair. We are using established settings as used in all previous work on CBO [4,13,14] as also mentioned in the previous discussion on hyperparameters, and we have made all best efforts to put our algorithm in the fairest setting possible to compare against both CBO and CD-CBO. In fact, regarding the threshold, our choice of $90\%$ is very conservative and favours CD-CBO. At $90\%$ mass, the algorithm already realizes the graph is found and switches to optimizing. With a higher treshold, CD-CBO would spend more time just discovering the graph (already found for practical purposes), which is suboptimal to optimize the causal effect. We did our best to explain all parameters involved in the previous bullet point and in the Supplementary, where we added more discussion. Our code will be public and our CBO implementation for comparison is directly based on publicly available code from [14] which can be found at https://openreview.net/forum?id=VhMwt_GhDy9. We stand by the fairness and completeness of the original results and are providing additional evidence for a comparison to CD-CBO as mentioned above.
* As mentioned in point 3. above, previous works on CBO [4,13,14] discussed at length the relationships between the causal global optimization problem and that of causal bandits (e.g. [18]). As mentioned in our related work Section, a few recent studies attempted to perform optimization in the bandit setting when the graph is unknown. In particular, in [20] they also find that full knowledge of the graph is not always needed to efficiently perform optimization, supporting our findings but in the related bandit setting.
* We have elaborated on our comparison with a sequential approach in one of the bullet points above. **To provide additional evidence, we also added an experiment with a comparison to CD-CBO on the Epidemiology example**.
[1] Glymour, C., Zhang, K. and Spirtes, P., 2019. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10, p.524.
[2] Gregory F. Cooper and Changwon Yoo. 1999. Causal discovery from a mixture of experimental and observational data. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (UAI'99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 116–125.
[3] Friedman, N., Koller, D. Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks. _Machine Learning_ **50,** 95–125 (2003).
[4] Aglietti, V., Lu, X., Paleyes, A. and González, J., 2020, June. Causal bayesian optimization. In International Conference on Artificial Intelligence and Statistics (pp. 3155-3164). PMLR.
[5] Active Learning of Causal Bayes Net Structure_. Kevin P. _Murphy_. Department of Computer Science. University of California. Berkeley, CA 94720-1776.
<!-- [6] Peters, J., Mooij, J.M., Janzing, D. and Schölkopf, B., 2014. Causal discovery with continuous additive noise models.
-->
[6] Wang, B., Wicker, M.R. and Kwiatkowska, M., 2022, June. Tractable Uncertainty for Structure Learning. In International Conference on Machine Learning (pp. 23131-23150). PMLR.
[7] Cundy, C., Grover, A. and Ermon, S., 2021. Bcd nets: Scalable variational approaches for bayesian causal discovery. Advances in Neural Information Processing Systems, 34, pp.7095-7110.
[8] Lorch, L., Rothfuss, J., Schölkopf, B. and Krause, A., 2021. Dibs: Differentiable bayesian structure learning. Advances in Neural Information Processing Systems, 34, pp.24111-24123.
[9] Roman Garnett, Bayesian Optimization book (2022). Cambridge University Press.
[10] Owen, A.B., 2019. Comment: Unreasonable Effectiveness of Monte Carlo. _Statistical Science_, _34_(1), pp.29-33.
[11] Toth, C., Lorch, L., Knoll, C., Krause, A., Pernkopf, F., Peharz, R. and von Kügelgen, J., 2022. Active Bayesian Causal Inference. arXiv preprint arXiv:2206.02063.
[12] Hägele, A., Rothfuss, J., Lorch, L., Somnath, V.R., Schölkopf, B. and Krause, A., 2022. BaCaDI: Bayesian Causal Discovery with Unknown Interventions. arXiv preprint arXiv:2206.01665.
[13] Aglietti, V., Damoulas, T., Álvarez, M. and González, J., 2020. Multi-task causal learning with gaussian processes. Advances in Neural Information Processing Systems, 33, pp.6293-6304.
[14] Aglietti, V., Dhir, N., González, J. and Damoulas, T., 2021. Dynamic Causal Bayesian Optimization. Advances in Neural Information Processing Systems, 34, pp.10549-10560.
[15] Hoyer, P., Janzing, D., Mooij, J.M., Peters, J. and Schölkopf, B., 2008. Nonlinear causal discovery with additive noise models. Advances in neural information processing systems, 21.
[16] von Kügelgen et al., Optimal experimental design via bayesian optimization: active causal structure learning for gaussian process networks. In NeurIPS 2019 Workshop Do the right thing: machine learning and causal inference for improved decision making.
[17] Gupta, S., Rana, S. and Venkatesh, S., 2022, May. Regret Bounds for Expected Improvement Algorithms in Gaussian Process Bandit Optimization. In International Conference on Artificial Intelligence and Statistics (pp. 8715-8737).
[18] S. Lee, E. Bareinboim. Structural Causal Bandits: Where to Intervene? In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, 2018.
[19] Wang, B., Wicker, M.R. and Kwiatkowska, M., 2022, June. Tractable Uncertainty for Structure Learning. In International Conference on Machine Learning (pp. 23131-23150). PMLR.
[20] Lu, Y., Meisami, A. and Tewari, A., 2021. Causal bandits with unknown graph structure. Advances in Neural Information Processing Systems, 34, pp.24817-24828.
[21] Frazier, P.I., 2018. Bayesian optimization. In Recent advances in optimization and modeling of contemporary problems (pp. 255-278). Informs.
[22] Imbens, G.W., 2020. Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4), pp.1129-79.
[23] Gilad, Y., Gellerman, G., Lonard, D.M. and O’Malley, B.W., 2021. Drug Combination in Cancer Treatment—From Cocktails to Conjugated Combinations. Cancers, 13(4), p.669.
[24] Shyr, Z.A., Cheng, Y.S., Lo, D.C. and Zheng, W., 2021. Drug combination therapy for emerging viral diseases. Drug discovery today, 26(10), pp.2367-2376.
## Reviewer sQxa
Thank you for appreciating the motivation, impact and the presentation of our work as well as the constructive feedback. We address each of your comments and suggestions below, in point and order. Any line numbers refer to the original paper, not the now revised version.
* **Causal sufficiency**: We agree that lifting this assumption is a key direction for methodological improvement, not only in our context but also more generally in causal discovery (with no effect optimization involved), and causal effect estimation. As mentioned in the response to Reviewer m9mV, the challenge of devising practical methods that do not give up other useful properties (in our case, e.g., a closed form graph likelihood) is still open, as for example evidenced by many recent state-of-the-art causal discovery methods still resorting to causal sufficiency [1,2,3,4,7]. In the context of treatment effect estimation, recent works have attempted to deal with unobserved confounders by modelling them as latent variables [5,6]. It will be a challenging but interesting future direction to attempt combining some of the ideas in these works to the causal global optimization setting with unknown graphs.
* **Graph posterior properties as scoring method**: Indeed it is common to use some feature of the graph posterior as scoring method in some causal discovery works. Note that in our setting, our objective does not perform causal discovery only; the entropy is joint between $y^\star$ and $G$. Previous works analyse the properties of active learning based on entropy reduction for structure learning only [8] (other references in the paper). The submodularity of the entropy can in some settings give certain guarantees on rates of convergence when greedily maximizing it. However, this has not been analysed for a joint entropy between the graph and $y^\star$.
* **Computational complexity**: We have now added a paragraph in Section 3.3 about complexity and a new Section in the Supplement with more details. The complexity of CEO is driven by the computation of CES. The parameters influencing these are: $|\mathbf{ES}|$, $\max_{I} |\mathbf{X}_I|$, the number of acquisition points $N$ (i.e., how many values of the intervened variable to consider, assuming the same number for all $\mathbf{X}_I$). Therefore, the total complexity of the acquisition is $\mathcal{O}(N \cdot \text{CES}(\mathbf{x}) \cdot \sum_{\mathbf{X}_I \in \mathbf{ES}} |\mathbf{X}_I| )$. Here, $\text{CES}(\mathbf{x})$ denotes the time needed to compute CES for m specific value $\mathbf{x}$, regardless of its corresponding $\mathbf{X}_I$. Note this is valid for CBO also, replacing $\text{CES}(\mathbf{x})$ with $\text{CEI}(\mathbf{x})$, the acquisition used by CBO. However,$\text{CEI}(\mathbf{x})$ is cheaper than $\text{CES}(\mathbf{x})$. The biggest reason in practice is that CEI operations can be more easily vectorized. CES operations however can be parallelized over both $\mathbf{ES}$ and values of $\mathbf{x}_I$. Our implementation indeed uses KDEs, but it does so only to estimate *univariate* marginal distributions, which is not very expensive (no curse of dimensionality). **Graphs**: We have stated clearly in lines 138-143 that in our setting since the number of nodes is not too large, there is a tractable number of graphs that can be enumerated (similarly to related work [16]), therefore omitted in the above Big-O notation. As mentioned in lines 138-139, larger spaces of graphs could be explored in future work by sampling with MCMC as common in the structure learning literature, losing however our closed form expectations over the graph. Work that explores the full space of DAGs with many nodes will necessarily introduce additional approximations. We also discussed scalability in the limitations in Section 6. It is a general limitation, not restricted to CEO, that in causal global optimization problems one needs to train $|\mathbf{ES}|$ GPs ( in general $|\mathbf{ES}|$ will scale as $2^{|\mathbf{X}|}$), which does not scale with large graphs (here exact GP inference is cubic in the number of collected interventions).
* **Including graph entropy in the objective**: We define the CES objective as the *joint* entropy of $y^\star$ and $G$ so that the mechanism by which interventions are selected *directly* depends on the graph posterior . You are correct in noticing that one use of the graph posterior is that of providing more robust uncertainty quantification in the surrogate model. However, with CES objective with exploit the graph posterior further by making our intervention selection mechanism directly dependent on it. This means that interventions will not only be selected to minimize the causal effect, but also try to (jointly) reduce the graph posterior entropy.
* **Less central comments / questions**:
* **Line 21**: We have rephrased the sentence to reflect this.
* **Line 28**: We will rephrase this to reflect our aim.
* **Line 76**: Yes, here "interventions" should say more specifically "intervention sets", as we call them, or "intervenable sets". We have fixed this.
* **Line 96**: The problem in CEO is still to minimize the true causal effect, which is conditioned on the true graph $\mathcal{G}$. The method to solve this (the CEO algorithm), uses a posterior over graphs to take into account its uncertainty in the surrogate models. However, the true quantity to be minimized is still the true $\mathbb{E}[Y | \text{do}(\mathbf{X}_I = \mathbf{x}_I), \mathcal{G}]$. We do not have direct *access* to this quantity ((1) because the graph is unknown, but also (2) because we only get samples from $p(Y | \text{do}(\mathbf{X}_I = \mathbf{x}_I), \mathcal{G})$; note (2) is true for CBO as well), but it is still the quantity that the method aims to minimize.
* **Line 142**: thanks, we will rephrase this.
* **Line 186**: this is a good point; in real examples, this will be given by domain knowledge. In our setting, we used the convention of CBO where intervening on $1$ node has cost $1$, while on two nodes has cost $2$, etc.
* **Line 200**: indeed, it refers only to the structure. We will explicitly mention that parameters (i.e. SCM functions in our case) can be integrated out in closed form thanks to GPs.
* **Line 250**: thanks, we will add this in the main text.
* **Line 273**: We omitted the mathematical expression due to space restrictions, but will include it (with some intuition) once the 9-page limit will be extended for the camera-ready.
* **Line 317**: We highlight the amount of data at the beginning of the experiments in Section 5 and will highlight the variables too.
* **Limitations discussion**: we fully agree on this and will extend the Section in the main paper once the 9-page limit will be extended for the camera-ready.
[1] Cundy, C., Grover, A. and Ermon, S., 2021. Bcd nets: Scalable variational approaches for bayesian causal discovery. Advances in Neural Information Processing Systems, 34, pp.7095-7110.
[2] Lorch, L., Rothfuss, J., Schölkopf, B. and Krause, A., 2021. Dibs: Differentiable bayesian structure learning. Advances in Neural Information Processing Systems, 34, pp.24111-24123.
[3] Toth, C., Lorch, L., Knoll, C., Krause, A., Pernkopf, F., Peharz, R. and von Kügelgen, J., 2022. Active Bayesian Causal Inference. arXiv preprint arXiv:2206.02063.
[4] Hägele, A., Rothfuss, J., Lorch, L., Somnath, V.R., Schölkopf, B. and Krause, A., 2022. BaCaDI: Bayesian Causal Discovery with Unknown Interventions. arXiv preprint arXiv:2206.01665.
[5] Louizos, C., Shalit, U., Mooij, J.M., Sontag, D., Zemel, R. and Welling, M., 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30.
[6] Wu, P. and Fukumizu, K., 2021. Intact-VAE: Estimating treatment effects under unobserved confounding. arXiv preprint arXiv:2101.06662.
[7] Wang, B., Wicker, M.R. and Kwiatkowska, M., 2022, June. Tractable Uncertainty for Structure Learning. In International Conference on Machine Learning (pp. 23131-23150). PMLR.
[8] Active Learning of Causal Bayes Net Structure_. Kevin P. _Murphy_. Department of Computer Science. University of California. Berkeley, CA 94720-1776.
## Reviewer Lswg
Thank you for your constructive feedback, interesting suggestions on related works and the positive comments. Thank you also for explicitly proposing to increase your score. We address each of your comments and suggestions below, in point and order. Any line numbers refer to the original paper, not the now revised version.
* **Consistency of posterior updates**: We are writing an explicit list of assumptions needed to ensure convergence of our $P(G)$ to a Dirac mass on the true graph $\mathcal{G}$ in detail Supplement Section B, while also mentioning them in Section 3.3 of the main paper. Briefly, these are: causal sufficiency; that all variables can be manipulated; infinite samples can be obtained from each node; "minimality" (i.e. the joint in line 74 does not Markov-factorize w.r.t. to any *sub-graph* of $\mathcal{G}$); support of $P(G)$ includes $\mathcal{G}$. Being able to intervene on all nodes assures us that the interventional Markov Equivalence Class (MEC) can be discerned [3]. As you noticed, when some variables are non manipulative, it is not guaranteed that the graph can be discovered. However a key contribution of our work is precisely to show that exact knowledge of the causal graph is not always needed to perform efficient optimization. Related work using active learning methods for causal discovery based on mutual information provides some finite-sample guarantees. Note that there are certain challenges in our setting as compared to previous related work. For example, [1] does not deal with nonparametric SCMs, and assumes infinite observational samples. Similarly, [2] considers information-theoretic objectives and proves a certain kind of sub-optimality, which however is not directly applicable to our setting (to the best of our understanding), since they also assume infinite observational samples.
* **Paragraph on the other uses of entropy in causality**: thank you for the suggestion; we added a paragraph in Related Work (main paper, Section 4) discussing your recommended papers and others, and we will expand more thoroughly in a new Section in the Supplement. Indeed, entropy and information-theoretic objectives have a large literature in all aspects of causality. With our work in particular, we connect two areas which have both used information-theoretic objectives previously: Bayesian Optimization (BO) and Causal Discovery (CD) (see our Table 1). In [5], the framework of Entropic Causal Inference aims at causal discovery between two categorical variables from observational data. It exploits an interesting assumption on the entropy of the exogenous variable on the (only) SCM function that differs from much of the literature on bivariate discovery. A related notion called *common entropy* for discrete random variables is discussed in [7], which generalizes its original formulation with Reny divergences. They show how to use this notion to improve classic algorithms such as PC and FCI. The works [4,6] discuss the use of entropy in larger Causal Bayesian Networks. A challenge in applying the insights in [4,6] to our setting is that we require the entropy of $p(y^\star | \mathcal{D})$, which has no simple relation to the causal factors of the SCM. However, if the local $p(y_{I}^\star | \mathcal{D})$ could be expressed as a tractable function of the SCM factors, this would greatly simplify the computation of the entropy.
* **Further clarifications**:
* **Notation**: Thanks for noticing, we had mixed up old and new notations. Eq. 3 and 4 now use $\mathbb{E}[ \mathbb{E}[Y | \text{do}(\mathbf{X}_I = \mathbf{x}_I), G]]$ and $\mathbb{V}[ \mathbb{E}[Y | \text{do}(\mathbf{X}_I = \mathbf{x}_I), G]]$.
* **Optimal** $y^\star$: In words: first, $y_{I}^\star$ is the best value of the causal effect that can be achieved in the surrogate model corresponding to invervention set $\mathbf{X}_I$. The *global* optimum $y^\star$ is the best value among all such $y_{I}^\star$'s. Recall from Eq. 1 that our goal is to *minimize* (it could be maximize) the causal effect; even when we know the true graph, there is one causal effect per $\mathbf{X}_I$, as Eq. 1 indicates. Thus, we want the minimum across all of these. Formally, the "optimal" $y^\star$ is defined as (line 176) $y^\star = \min_{I} y^{\star}_I$ . Further, $y_{I}^\star=\min_{I} \mathbb{E}[Y \mid \text{do} ({\mathbf{X}_I} = {\mathbf{x}_{I}}), G]$ ( throughout $I$ denotes the index of a selected intervention, recalling $\mathbf{X}_I \in \mathbf{ES}$). We hope this clarifies this point.
* See discussion on posterior above.
* **On joint entropy**: indeed without restrictions on densities, even just differential entropy alone is not bounded; so will be our joint entropy definition. In our context, the densities of interest are the univariate marginals $p(y_{I}^\star | \mathcal{D})$. There is one per surrogate model. These marginals (and their differential entropy) have been considered in the context of BO and used in Max-Value Entropy Search (MES) and its further extensions [10,11,12]. The true density here is intractable; it is defined implicitly by taking the maximum over sample GP paths. These types of distributions have been studied rigorously in [13]. Further, note that the distribution $p(y^\star | \mathcal{D})$ is even more intractable, as it is defined implicitly in terms of the individual r.v.'s $y_{I}^\star$. We are not aware of previous work in the BO literature that provides guarantees on the boundedness of the differential entropy of $p(y_{I}^\star | \mathcal{D})$ when estimated with KDEs (or other estimates). Characterizing $p(y^\star | \mathcal{D})$ will be even harder. The book [17] provides general conditions to check for existence and boundedness of a differential entropy. The work in [19,Lemma 2.3 and Eq. 2.2] shows the standard definition of joint entropy for discrete or continuous variables can be analogously extended to the mixed case, as we now point out in our revised version of the paper below Eq. 7.
* **Additive noise**: thank you for raising this; indeed this deserves more discussion. We added a paragraph on "parametric assumptions" in Supplement Section H. Firstly, the additive Gaussian noise assumption (let us call it AGN- additive Gaussian noise) combined with GPs on the structural equations allows us to define the graph likelihood in closed form. This also implies that GP posteriors on the structural functions are available in closed form. When AGN does not hold, one would need to resort to the literature on Variational GPs [14,15] which allows to perform inference with GPs with non-Gaussian likelihoods. Secondly, the AGN assumption also allows us to deal with noisy Bayesian optimization. Recall that, even if we knew the graph, we do not get to observe $\mathbb{E}[Y \mid \text{do} ({\mathbf{X}_I} = {\mathbf{x}_{I}}), G]$, but can only get samples from $p(Y \mid \text{do} ({\mathbf{X}_I} = {\mathbf{x}_{I}}), G)$. BO with non-Gaussian noise is an active area of research, see e.g. [16] who allow for sub-Gaussian likelihood noises. The optimization becomes more sophisticated, and note that CEO (and CBO) needs to deal with multiple surrogate models (one for each $\mathbf{X}_I$). Therefore, extending these methods from the BO setting to CBO and CEO is a future area of research.
* **Other parametric assumptions**: Beyond the previously discussed additive Gaussian noise assumption, in our experiments we used radial basis function kernels, consistently with previous works on CBO. These were appropriate kernels for the SCMs we studied in the experiments; however, for nonstationary causal effect functions one could use nonstationary GP kernels.
* **Line 201**: indeed, the factorization $P(G | \left.y^{\star}, \mathcal{D}\right) p\left(y^{\star} \mid \mathcal{D}\right)$ in our line 201 is without loss of generality. However, one could consider approximating the alternative factorization $p\left(y^{\star} \mid G, \mathcal{D}\right) P(G \mid \mathcal{D})$ instead. We discussed this in Supplement Section H.
* **Assumption in line 211**: This is about our proposed approximation in Eq. 8 to the true, intractable $p(y^\star | \mathcal{D})$. While we described it as a heuristic, in the sense that it does not lead to a guarantee on the number of interventional samples needed to reach $y^\star$, it is (1) consistent, since as explained in lines 217-220, while interventions are collected the mixture concentrates on $p(y_{I^{\text{best}}}^\star | \mathcal{D})$ where $I^{\text{best}}$ is the index of the surrogate model that actually contains the global optimum. This is because $P(\mathbf{X}_I = \mathbf{X}^{\star})$ is modelled via Upper Confidence Bound (UCB), which has known theoretical guarantees (see e.g. [18] ). Then, recall we obtain approximations to individual $p(y_{I}^\star | \mathcal{D})$ with KDEs, which make make no parametric assumptions and are generally consistent under mild conditions. The entropy of a KDE is then computed accurately with numerical quadrature. Other approximations to the true, intractable $p(y^\star | \mathcal{D})$ could certainly be proposed. Our approximation is a natural choice as it is based directly on approximations of the local $p(y_{I^{\text{best}}}^\star | \mathcal{D})$, which have been studied in the BO literature on MES [10,11,12]. We have now expanded on this in Supplement Section M.
[1] Agrawal, R., Squires, C., Yang, K., Shanmugam, K. and Uhler, C., 2019, April. Abcd-strategy: Budgeted experimental design for targeted causal structure discovery. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3400-3409). PMLR.
[2] Greenewald, K., Katz, D., Shanmugam, K., Magliacane, S., Kocaoglu, M., Boix Adsera, E. and Bresler, G., 2019. Sample efficient active learning of causal trees. Advances in Neural Information Processing Systems, 32.
[3] Hauser, A. and Bühlmann, P., 2015. Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(1), pp.291-318.
[4] Steudel, B. and Ay, N., 2015. Information-theoretic inference of common ancestors. Entropy, 17(4), pp.2304-2327.
[5] Kocaoglu, M., Dimakis, A.G., Vishwanath, S. and Hassibi, B., 2017, February. Entropic causal inference. In Thirty-First AAAI Conference on Artificial Intelligence.
[6] Miklin, N., Abbott, A.A., Branciard, C., Chaves, R. and Budroni, C., 2017. The entropic approach to causal correlations. New Journal of Physics, 19(11), p.113041.
[7] Kocaoglu, M., Shakkottai, S., Dimakis, A.G., Caramanis, C. and Vishwanath, S., 2020. Applications of common entropy for causal inference. Advances in neural information processing systems, 33, pp.17514-17525.
[8] Janzing, D., 2021. Causal versions of maximum entropy and principle of insufficient reason. Journal of Causal Inference, 9(1), pp.285-301.
[9] Shanmugam, K., Kocaoglu, M., Dimakis, A.G. and Vishwanath, S., 2015. Learning causal graphs with small interventions. Advances in Neural Information Processing Systems, 28.
[10] Wang, Z. and Jegelka, S., 2017, July. Max-value entropy search for efficient Bayesian optimization. In International Conference on Machine Learning (pp. 3627-3635). PMLR.
[11] Belakaria, S., Deshwal, A. and Doppa, J.R., 2019. Max-value entropy search for multi-objective bayesian optimization. Advances in Neural Information Processing Systems, 32.
[12] HB Moss et al., GIBBON: General-purpose Information-Based Bayesian Optimisation. JMLR 2022.
[13] Adler, R.J. and Taylor, J.E., 2007. Random fields and geometry (Vol. 80). New York: Springer.
[14] Titsias, M., 2009, April. Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics (pp. 567-574). PMLR.
[15] James Hensman et al. Gaussian Processes for Big Data. UAI (2013)
[16] Kirschner, J., Bogunovic, I., Jegelka, S. and Krause, A., 2020, June. Distributionally robust Bayesian optimization. In International Conference on Artificial Intelligence and Statistics (pp. 2174-2184). PMLR.
[17] Cover, T.M., 1999. Elements of information theory. John Wiley & Sons.
[18] Roman Garnett, Bayesian Optimization book (2022). Cambridge University Press.
[19] Marx, A., Yang, L. and van Leeuwen, M., 2021. Estimating conditional mutual information for discrete-continuous mixtures using multi-dimensional adaptive histograms. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM) (pp. 387-395). Society for Industrial and Applied Mathematics.
## Reviewer yY4K
Thank you for your constructive feedback and for appreciating the contributions of our work. We appreciate that you found the work original and relevant, and that the experimets are well designed. We address each of your comments and suggestions below, in point and order. Any line numbers refer to the original paper, not the now revised version.
* **Problem statement**: indeed our problem statement generalizes that of CBO. We revised our sentence in the problem statement to clarify that indeed either the graph needs to be estimated directly, or like in our work, we can maintain a posterior over graphs and take into account the uncertainty when making decisions that require graph knowledge.
* **Notation**: Apologies for the confusion. We have now clarified this by specifiying right before Eq. 3 and 4. The symbols $G$ and $\mathcal{G}$ denote two different things: the former denotes the *random variable* that has support, say, in the set $\{g_1,g_2,\mathcal{G} \}$, and has probability mass function $P(G)$; the latter, $\mathcal{G}$, denotes the true graph. Hence, in this example $g_1$ and $g_2$ would be two alternative wrong graphs. Further, note that sometimes we write $\mathbb{E}[Y | \text{do}(\mathbf{X}_I = \mathbf{x}_I), \mathcal{G}]$ as a shorthand for $\mathbb{E}[Y | \text{do}(\mathbf{X}_I = \mathbf{x}_I), G = \mathcal{G}]$. Also, note that we have to show dependence on $G$ in Eq. 1 before introducing our latent variable formulation, so there we also write this dependence as $\mathbb{E}[Y | \text{do}(\mathbf{X}_I = \mathbf{x}_I), \mathcal{G}]$.
* **Space of DAGs**: As we mention in lines 134-139, in our setting since the number of nodes is not too large, there is a tractable number of graphs that can be enumerated. In this setting, we represent the graph as an adjacency list in our implementation. As mentioned in lines 138-139, larger spaces of graphs could be explored in future work by sampling with MCMC as common in the structure learning literature [2,3], losing however our closed form expectations over the graph in the surrogate model and in the acquisition. Work that explores the full space of DAGs with many nodes will necessarily introduce additional approximations. We also discussed scalability in the limitations in Section 6. It is a general limitation, not restricted to CEO, that in causal global optimization problems one needs to train $|\mathbf{ES}|$ GPs ( in general $|\mathbf{ES}|$ will scale as $2^{|\mathbf{X}|}$), which does not scale with large graphs. We also note that many applied problems of interest for CBO only admit graphs with very few manipulative variables, for example in econometrics [1; Line 13, page 3] and healthcare (e.g. drugs cocktails for curing diseases ranging from cancer [4] to viral infections [5] include a very small number of drugs); in these applications the cost of performing a real intervention (e.g. administring a drug) $\text{Co}(\mathbf{X}_I,\mathbf{x}_I)$ will also be more important than that of used to run simulations.
* **Recent related work on mixture of interventional/observational data**: Thanks for the references recommendation. The second reference was already cited in our related work and we have now added the first one.
[1] Imbens, G.W., 2020. Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4), pp.1129-79.
[2] Eaton, Daniel, and Kevin Murphy. "Bayesian structure learning using dynamic programming and MCMC." arXiv preprint arXiv:1206.5247 (2012).
[3] Kuipers, J., Suter, P. and Moffa, G., 2022. Efficient sampling and structure learning of Bayesian networks. Journal of Computational and Graphical Statistics, pp.1-12.
[4] Gilad, Y., Gellerman, G., Lonard, D.M. and O’Malley, B.W., 2021. Drug Combination in Cancer Treatment—From Cocktails to Conjugated Combinations. Cancers, 13(4), p.669.
[5] Shyr, Z.A., Cheng, Y.S., Lo, D.C. and Zheng, W., 2021. Drug combination therapy for emerging viral diseases. Drug discovery today, 26(10), pp.2367-2376.
## Reviewer dJyF
Thank you for your constructive feedback and for appreciating the contributions of our work. We address each of your comments and suggestions below, in point and order. Any line numbers refer to the original paper, not the now revised version.
* **Sample efficient active learning of causal trees**: We agree that finding an optimal or near-optimal experiment design criterion is an interesting future work. We have added this to our Section 6 where we discuss future work. To the best of our understanding, the results in [1] do not directly apply to our setting, because (1) they assume infinite observational data (as they mention in the abstract) and do not need to optimize a causal effect. It would indeed be very interesting if the techniques can be extended to our setting.
* **Line 101**: We agree, we have corrected this statement to "unified" solution.
* **Line 253**: We apologise for the confusion here. What we meant was simply that no previous work has explicitly attempted to perform CD and then CBO in a sequence (this is the first work to consider unknown graph at all *in the context of CBO*, to the best of our knowledge), although it is a very natural baseline that we indeed used. We have removed the confusing sentence as you suggested.
* **Equation (2)**: Apologies for the confusion. We have now fixed the notation (we mixed an old and newer notation between Eqs. 2 and 3,4). To be clear: $\widehat{\mathbb{E}}\left[Y \mid \operatorname{do}\left(\mathbf{X}_{I}=\mathbf{x}_{I}\right)\right]$ is the object we will use as our mean function, and it is defined as the marginal expectation of $Y$ where the graph has been integrated out. The "hat" stands there to denote that it is an approximation, since observational data with the do-calculus is used in the prior to approximate these terms. Therefore, it can be rewritten as Eq 3. Secondly, $\widehat{\mathbb{V}}\left[Y \mid \operatorname{do}\left(\mathbf{X}_{I}=\mathbf{x}_{I}\right)\right]$ is the marginal variance of $Y$ where the graph has been integrated out. Therefore, it can be rewritten *as the sum of second and third terms* of Eq. 4 by using the law of total variance (the first one is a chosen base kernel function). We hope this clarifies.
* **Line 128**: We note that this noise term is not only good for robustness, but it is directly implied by the assumptions in Eq. 5. Indeed, given these assumptions, any interventional density $p(Y | \text{do}(\mathbf{X}_{I} = \mathbf{x}_I), G = \mathcal{G})$ will be a Gaussian (since a product of Gaussian densities). However, one cannot assume the mean of this distribution, $\mathbb{E}[Y | \text{do}(\mathbf{X}_{I} = \mathbf{x}_I), G = \mathcal{G}]$ (causal effect) is known and can be observed. By intervening, we only get a sample from $p(Y | \text{do}(\mathbf{X}_{I} = \mathbf{x}_I), G = \mathcal{G})$. In short, because we know $p(Y | \text{do}(\mathbf{X}_{I} = \mathbf{x}_I), G = \mathcal{G})$ is a Gaussian (again, assuming Eq. 5), we know $Y$ can be written in the form of line 128, and therefore we can estimate the noise during optimization. To our understanding, this cannot be easily done by modifying the original kernel (similarly as in noisy BO) . To understand the above better, notice for example that when we intervene on the parents of $Y$, then $p(Y | \text{do}(\mathbf{X}_{I} = \mathbf{x}_I), G = \mathcal{G})$ is just one of the Gaussians from Eq. 5. When one intervenes more generally on another set of variables that do not coincide with the parents, the resulting distribution won't be one of the Gaussians from Eq. 5, but it will be given by a product of those (as the truncated factorization applies).
[1] Greenewald, K., Katz, D., Shanmugam, K., Magliacane, S., Kocaoglu, M., Boix Adsera, E., & Bresler, G. (2019). Sample efficient active learning of causal trees. Advances in Neural Information Processing Systems, 32