# Group CATE error project proposal In this project we investigate whether the *conditional average treatment effect* (CATE) can be poorly estimated for some groups of inputs (i.e., some $x$'s) that are underrepresented in the data, even if the CATE is estimated well on average. This is analogous to a well-studied problem problem in ML: flexible methods can *predict* poorly for some groups even if they predict well on average. The difference is that we study the quality of CATEs rather than the quality of predictions. However, as ML methods are increasingly used in CATE estimation, it seems likely that the same issues (and possibly solutions) will arise. So, I will refer to "performing poorly on groups while performing well on average" as ***"the problem"*** and I argue it exists in prediction and CATE estimation tasks. I aim to analyze the problem in the CATE case in a few steps, ordered by highest priority. (Just to be clear, $\text{CATE}(x)$ is the conditional (on $x$) average treatment effect. It is sometimes called a heterogeneous effect) ## Step 1: Existing methods I do not believe there are existing methods that explicitly try to minimize CATE error across groups **(please let me know if I'm missing any)**. Of course, methods exist to minimize prediction error across groups, and estimating the CATE involves estimating a prediction function, but to my knowledge no one has proposed using such methods as part of CATE estimation. I plan to try them as "new methods" in Step 2. So the "existing methods" are just standard CATE estimation methods. ### Step 1(a): Empirical analysis of existing methods Although the problem is well-studied in the prediction case, to my knowledge it is not well-studied in the CATE case. So, although I don't think it would surprise anyone that the problem can exist in the CATE case, I would need to show it in a paper. For datasets, ideally I will use high dimensional datasets that require nonlinear modeling, although most of the datasets I can find have only 20-30 feature (see challenges below). For now I am assuming a binary treatment but this isn't necessary (although it is to apply many of the methods). For methods, I will use the suite of CATE "[metalearners](https://arxiv.org/pdf/2009.06472.pdf)" (e.g., T-Learners, S-Learners) with GPs and neural nets as the underlying models. This includes multiheaded neural network models like Tarnet and Dragonnet. I plan to pay particular attention to the role of *cross fitting* (i.e., estimating prediction functions on part of the data and then estimating the CATE on the remaining data). My understanding is cross fitting can mitigate the impact of overfitting in the prediction function, which could be especially impactful for underrepresented groups. There are a couple challenges in the empirical analysis: - *Challenge 1*: In the prediction case, the problem is typically demonstrated on high-dimensional image data. Unfortunately, there are not many applications where estimating a treatment effect conditional on an image makes sense. So, I think we'll need to use tabular datasets, which are often lower-dimensional and may not exhibit the problem as strongly. Here's more on images since it's come up in meetings [skip this if you think tabular data is fine]: - For example, "Waterbirds" is a popular dataset in the prediction case. The idea is that there are few examples in the dataset of waterbirds on land and landbirds on water, so a bird-type classification model (waterbird vs. landbird) will not perform well on these examples. What's the analogous application in the CATE case (i.e., where you estimate $\text{CATE}(x)$ where $x$ is an image)? We need a treatment and outcome such that the image of the bird could plausibly modify the effect of the treatment on the outcome *while also not containing the treatment and/or outcome*. For example, if you studied the causal effect of the color of the bird (treatment) on the type of the bird (outcome), then conditioning on the entire image implies also conditioning on the treatment and outcome, since they are observable from the image. So the problem doesn't make sense. - I have only found one paper that studies a CATE conditional on an image. In their case, the treatment is an economic development program, the outcome is an economic indicator, and the image is a satellite photo of the area where the treatment is applied. The idea is there might be information in the image (e.g., existing infrastructure) that could impact the effect of the treatment. The key is that the image is taken pre-treatment, so you can condition on it without conditioning on the treatment or outcome. - Perhaps you could condition on, e.g., a chest X-ray taken pre-treatment? But I have not actually seen someone do this and the issue (of course) is we want to critique existing applications, not new ones we invented. **Are there any papers that do something like this?** - *Challenge 2*: In the prediction case, we can use held-out test data to assess the quality of the predictions. In the CATE case, we don't know the ground truth, so we need to simulate outcomes. This is a challenge in any causal inference paper, but it seems especially challenging here because we're trying to show a problem exists in the first place, but a critic might argue the problem only exists in our simulated dataset and not on real datasets. I have ideas for simulated datasets, but I don't know if they are any good. ### Step 1(b): Theoretical analysis of existing methods For theory, I am thinking along the lines [Rolf 2020](http://proceedings.mlr.press/v139/rolf21a/rolf21a.pdf) but adapted to the CATE case. They modeled the generalization error of predictions as a power law depending on number of observations in total and in the underrepresented group. They also studied the role of the weight on observations in and outside of the group (as many solutions to the problem involve re-weighting the observations in the loss function). There are some differences with Rolf 2020: - *Different quantities of interest*: This should go without saying, but we are estimating the generalization of $\text{CATE}(x)$ rather than $f(x)$. - *Different estimation techniques*: in Rolf 2020's case, the parameters are estimated with MLE. In our case, the parameters might be estimated with a causal technique, e.g. double ML. Rolf 2020 included an example of a linear model with group-specific intercepts, where the generalization error can be derived in closed form. I can think of extensions to this example (which could be made in the prediction case but I'll make in the CATE case): - *Replacing the features $x$ by a nonlinear basis $\phi(x)$.* Results in random matrix theory could provide closed-form solutions. - *Including interactions between the treatment and features.* Without this, the CATE is just an ATE that differs by group. **Do you think this is the right thing to study theoretically, something like $\text{Error}(\text{CATE}_{\text{group }g})\propto a n^{-q} + b n_g^{-p}$?** (where $n$ and $n_g$ are the total and within-group number of observations and $(a, b, p, q)$ are constants). ## Step 2: New methods ### Step 2(a): Empirical analysis of new methods Here are two categories of new methods I could propose and empirically analyze using the same framework as in Step 1(a): - *Re-weighting observations*: In the prediction case, it seems the most popular approaches involve assigning higher weight in the loss function to observations from the underrepresented group. This includes [group DRO](https://arxiv.org/pdf/1911.08731.pdf), which defines the loss function as the worse case loss over the groups. **Can you just apply same group robustness approaches in the prediction component of CATE (i.e., estimating $f(t, x) = E[Y\mid T=t, X=x]$ (where $t$ is the treatment)) or is there something different about the CATE case? If you don't know, is this a good research question?** - *Sharing information between groups*: I would think a classical approach is to use a random effects model, where the group-specific effects are assumed to come from the same distribution, thus allowing the underrepresented group effect to differ from the larger group but also borrow information from it. **I'm familiar with approaches like this for random slopes or intercepts, but what's the analog for the more flexible ML methods we study in this project (e.g., neural nets)? To me, a multi-headed neural net (one head per group) seems like a natural approach, but has anyone actually done this in the prediction case? If not, why? Because if it doesn't work in the prediction case, I'm not sure why it would work in the CATE case. Basically, is a multi-headed net like this worth studying?** - *Cross-fitting:* This approah is known to mitigate overfitting when ML methods are used for causal effect estimation. Could it be especially impactful when there are few observations for some groups, so the model might overfit on these observations? ### Step 2(b): Theoretical analysis of new methods This is just the analysis in Step 1(b) but applied to the new methods discussed in Step 2(a).