## Abstract (pro) The predictive power of machine learning (ML) methods are increasingly being harnessed to estimate personalized (i.e., heterogeneous) treatment effects. However, ML methods can predict poorly for atypical or underrepresented groups of inputs even if they predict well on average, which is especially concerning when equity considerations are violated. In this work, through a variety of datasets and estimation techniques (including different "metamodels"), we demonstrate that the same issues arise in practice when estimating heterogenous treatment effects. We then evaluate the ability of group robustness methods, which are designed for the prediction case (e.g., Group Distributionally Robust Optimization), to mitigate the discrepancies between groups in the causal case. Throughout we pay particular attention to uncertainty quantification, which has received less attention in the prediction case. ## Critical review (con) This work rests on two contributions: (1) showing (empirically) that poor group predictions translates to poor casual effect estimates, (2) evaluating (empirically) the improvement from using a group robust loss (i.e., for the prediction function plugged into the causal effect estimator). The emphasis on uncertainty quantification is appreciated. While this issue has seemingly not received significant attention from the causal ML community, I claim that the conclusions in (1) should be expected and that the methods evaluated in (2) are not really novel. Therefore, this paper should be judged on the quality of its experiments as essentially a "bake sale" paper (which has value, but will require a high standard of quality and may not be appropriate for some venues). To see why (1) is expected, consider that in the setting the authors consider, the potential outcome distribution $P(Y^t|X)$ can be replaced with the conditional probability distribution $P(Y|t, X)$ and thus estimated from data. This implies any errors in $P(Y|t, X)$ will directly show up in as errors in $P(Y^t| X)$. In fact, the author's own data reveals this result. Here is a scatter plot of the prediction error vs the causal effect error: <img src="https://i.imgur.com/Hzmc8iQ.png" width="50%" height="50%"> (The colors correspond to different random datasets and the symbols are different random initializations). Notice there is a strong correlation between prediction errors and heterogeneous causal effect errors. In case the authors counter that they have discovered an interesting relationship, I want to emphasize that it should not be a surprise that this relationship exists. To see why (2) is not novel, consider that the whole point of the "metamodel" framework the authors use is that *any* model (i.e., prediction function) can be plugged into the same causal estimator (i.e., the metamodel). By the logic in the previous paragraph, if one cares about the group robustness of the causal effect then one should plug-in a model that cares about the group robustness of the prediction. The same holds for any other desirable criterion, like interpretability, sparsity, etc. I argue this is not a novel idea because this is the whole point of metamodels: to separate the model from the causal effect estimation procedure. While the approach is novel in the sense that it has not been evaluated in the context of group robustness, I argue the idea itself is not novel, it's just an application of metamodels and group robustness as they are meant to be applied. If the authors had brought attention to an important issue and done something novel about it, that would be a strong paper. However, bringing attention to an issue that should not come as a suprise to the community and evaluating only off-the-self solutions qualifies this paper as a "bake sale" paper only. ## My take It's possible that in the course of doing experiments, we'll discover something interesting and that will be the focus of the paper. If that doesn't happen, I think this project is missing the answer to this question: *what's different about the causal case?* I think we need to show that something is different to motivate the paper. Because if you really can take off-the-shelf group-robustness approaches and plug them in to standard casual effect estimators, then it doesn't seem like we're doing something new. Rather, it seems more like we're using existing tools as they're meant to be used, and the only contribution is a careful empirical evaluation. Regarding the plot, the relationship is not 1-1, although there is a strong correlation.