HackMD - Collaborative Markdown Knowledge Base

## 11/22 ### Summary - I'm still trying to figure out a good motivating example - I'm confused about how to use the "residual on residual" estimator in the partially linear case with different treatment effects by group - Naive application seems wrong - I could instead use the AIPW estimator, which is what the paper uses in the interactive case, but this has two drawbacks: - Does not handle continous treatments - Requires jointly fitting $(\hat{\theta_0}, \hat{\theta_1}, \hat{u}(X))$ in the model $Y=\hat{\theta}_0TG_0 + \hat{\theta}_1TG_1 + \hat{u}(X)$ (i.e., instead of just fitting $Y=\hat{l}(X)$ and doing the "residual on residual" trick). ### Updated fake abstract :::info There are a host of representation learning techniques, architecture designs, regularization techiniques designed to help (?). Unfortunately, as in prediction tasks, we show there can be substantial variation in the quality of estimation for subgroups that are not well-represented in the training data. In this paper, we evaluate how the three axes listed above impact three performance metrics across subgroups in the data: test performance, effect estimation, and uncertainty quantification. ::: Still a work in progress. ### Reminder: using groups in model - Currently I'm assuming the group labels are used by the model. - Alternatively, I could compute $\text{CATE}(X)$ without group information and then study how the test accuracy varies by $X$'s from different groups. - Note: This approach does not make sense in the case of the partially linear model $Y=\theta T + u(X)$, since the CATE is constant: $\text{CATE}(X)=\text{ATE}=\theta$. ### Motivating examples | Application | $Y$ | $X$ | $T$ | $G$ | |:-------------------:|:-----------------------:|:------------------------:| ------------------ |:----------:| | Facial recognition | Hair color | Image of person | ? | Sex | | Bird classifier | Type of bird | Image of bird | ? | Background | | Bangladesh data | Neurodevelopment scores | Demographics | Chemical mixture | ? | | Spatial env. health | Health outcome | Geography, demographics? | Chemical mixture | ? | | NHANES | Triglycerides | Demographics | Volitile compounds | ? | | IHDP | Cognitive scores | Demographics | Specialist visit | ? | | NEWS | Reader opinion | Words | Device type (and reading time) | ? | ### Additive models *Without groups:* $$ Y = v(T) + u(X) $$ In the partially linear case, $v(T) = \theta T$. *With two groups:* $$ Y = (v_0(T) + u_0(X)) G_0 + (v_1(T) + u_1(X)) G_1 $$ where $G_0$ and $G_1$ are group indicators. There are four cases, depending on if the two functions differ by group: | | **$u_0(X)=u_1(X)$** | **$u_0(X)\neq u_1(X)$** | |:----------------------- | -----------------------------------:|:---------------------------------------------------:| | **$v_0(T)=v_1(T)$** | $Y = v(T) + u(X)$ | $Y = v(T) + u_0(X) G_0 + u_1(X)G_1$ | | **$v_0(T)\neq v_1(T)$** | $Y = v_0(T) G_0 + v_1(T)G_1 + u(X)$ | $Y = (v_0(T) + u_0(X)) G_0 + (v_1(T) + u_1(X)) G_1$ | The top-left is the standard double ML case. I'm trying to figure out the two off-diagonal cases (which will give me the lower-right). But first the general algorithm. #### Nonlinear "residual on residual" estimator Suppose you want to estimate $\theta\in \mathbb{R}^{d_T}$: $$ Y = T \theta + u(X) $$ Analogous to the Frisch–Waugh–Lovell algorithm for linear regression, the procedure of [Robinson 1988](https://www.jstor.org/stable/1912705) is to: - Fit $\hat{Y}=\hat{l}(X)$, call the residuals $\hat{R}_Y \in \mathbb{R}^n$ - Fit $\hat{T}=\hat{m}(X)$, call the residuals $\hat{R}_T \in \mathbb{R}^{n\times d_T}$ - Linear regress $\hat{R}_Y$ on $\hat{R}_T$ to get estimate for $\theta$: $$ \hat{\theta} = (\hat{R}_T^\top \hat{R}_T)^{-1}\hat{R}_T^\top\hat{R}_Y $$ *Note: The Double ML paper also uses a different estimator. I'll write it here but otherwise I am not using it in this post:* $$ \hat{\theta} = (\hat{R}_T^\top T)^{-1}\hat{R}_T^\top(Y-\hat{u}(X)) $$ #### Different treatments effects by group ($v_0(T) \neq v_1(T)$) I'll assume the following partially linear model: $$ Y = \theta_0 TG_0 + \theta_1 TG_1 + u(X) \\ T = m(X) $$ where $T\in\mathbb{R}^{n\times 1}$. Rewrite as $Y$ model as: $$ Y = \tilde{T}\theta + u(X) $$ where $\tilde{T} = [TG_0\;\;TG_1]\in\mathbb{R}^{n\times2}$. What's the right way to apply the "residual on residual algorithm" to estimate $\theta\in\mathbb{R}^2$? - Fit $\hat{Y}=\hat{l}(X)$, call the residuals $\hat{R}_Y \in \mathbb{R}^n$ - Fit $\hat{\tilde{T}}=\hat{m}(X)$, call the residuals $\hat{R}_\tilde{T} \in \mathbb{R}^{n\times 2}$ - Linear regress $\hat{R}_Y$ on $\hat{R}_\tilde{T}$ to get estimate for $\theta$: $$ \hat{\theta} = (\hat{R}_\tilde{T}^\top \hat{R}_\tilde{T})^{-1}\hat{R}_\tilde{T}^\top\hat{R}_Y $$ But the second step seems weird... you're predicting a sparse response like $$ \tilde{T}:=[TG_0\;\;TG_1]= \begin{bmatrix} t_0 & 0\\ 0 & t_1\\ 0 & t_2\\ t_3 & 0\\ 0 & t_4\\ \end{bmatrix} $$ But this is what you get by applying the "residual on residual" approach. But it doesn't seem to agree with the model structure, which assumes $T=m(X)$ not $\tilde{T}=m(X)$. **So what's the right answer?** Here's what I think the answer should be: - Fit $\hat{Y}=\hat{l}(X)$, call the residuals $\hat{R}_Y \in \mathbb{R}^n$ - Fit $\hat{T}=\hat{m}(X)$, define the residuals $\hat{R}_\tilde{T}' \in \mathbb{R}^{n\times 2}$ as $$ \hat{R}_\tilde{T}' := \tilde{T} - \hat{T} = [TG_0-\hat{T}\;\;TG_1-\hat{T}] = \begin{bmatrix} t_0 - \hat{t}_0 & \hat{t}_0\\ \hat{t}_1 & t_1 - \hat{t}_1\\ \hat{t}_2 & t_2 - \hat{t}_2\\ t_3 - \hat{t}_3 & \hat{t}_3\\ \hat{t}_4 & t_4 - \hat{t}_4\\ \end{bmatrix} $$ - Linear regress $\hat{R}_Y$ on $\hat{R}_T$ to get estimate for $\theta$: $$ \hat{\theta} = (\hat{R}_\tilde{T}'^\top \hat{R}_\tilde{T}')^{-1}\hat{R}_\tilde{T}'^\top\hat{R}_Y $$ --- *Notice this is different from before:* $$ \hat{R}_\tilde{T} := \tilde{T} - \hat{\tilde{T}} = [TG_0-\hat{T}_0\;\;TG_1-\hat{T}_1] $$ *where $\hat{T}_0$ is the prediction of $TG_0$ and $\hat{T}_1$ is a prediction of $TG_1$. Which one is correct?* --- #### Different treatments effects by group ($u_0(X) \neq u_1(X)$) Same as above except for the $Y$ model: $$ Y = v(T) + u_0(X)G_0 + u_1(X)G_1 $$ Inference is - Predict $Y=\hat{u}_0(X)G_0 + \hat{u}_1(X)G_1$, call the residuals $\hat{R}_Y \in \mathbb{R}^n$ - Is this the same as fitting $Y=\hat{u}_0(X)$ for observations where $G_0=1$ (and similarly for $\hat{u}_1$)? - Predict $T$ from $X$, call the residuals $\hat{R}_Y \in \mathbb{R}^{n}$? - Linear regress $\hat{R}_Y$ on $\hat{R}_T$ to get estimate for $\theta$: $$ \hat{\theta} = (\hat{R}_T^\top \hat{R}_T)^{-1}\hat{R}_T^\top\hat{R}_Y $$ ## 11/11 ### Fake abstract :::info In recent years, machine learning methods have been used to estimate complex, heterogeneous causal effects. Naturally, a host of regularization techniques have been designed to control the flexibiliy of such models. Unfortunately, as in prediction tasks, we show there can be substantial variation in the quality of estimation for subgroups that are not well-represented in the training data. In this paper, we evaluate two solutions in terms of accuracy and uncertainty quantification: cross-fitting and learning a shared representation across subgroups in the data. ::: ### Notation - $G$: Group indicator - $X$: covariates (could be confounders) - $T$: treatment - $Y$: outcome - $Y^t$: potential outcome for treatment $T=t$ ### Decision to be made #### Are the group assignments used in the model? - Yes, i.e. we can treat the group assignment $G$ like any other observed covariate :heavy_check_mark: - No, we only use the covariates $X$ and then study the behavior on groups after inference (in which case the groups are effectively subsets of the covariates) #### What quantity do we want to compute? If the groups are part of the model, we could estimate either of the following: - $\text{CATE}_t(g) = \mathbb{E}[Y^{t} - Y^{t_0} \mid G=g]$ :heavy_check_mark: - $\text{CATE}_t(g,x) = \mathbb{E}[Y^{t} - Y^{t_0} \mid G=g, X=x]$ where $t_0$ is a baseline treatment. Notice $\text{CATE}_t(g) = \mathbb{E}_X[\text{CATE}_t(g,X)]$ If the groups are not part of the model we would estimate $\text{CATE}_t(x) = \mathbb{E}[Y^{t} - Y^{t_0} \mid X=x]$ and then compare its performance on different $x$'s. #### How do the groups impact the covariates? I would think the distribution of covariates differs by group. #### How do the groups impact the treatment? That is, which of the following are the model for $T$: - $T = m(X)$ - $T = m(X, G)$ - $T= m(X) + r(G)$ #### How do the groups impact the outcome? One extreme is fully interactive: - $Y = f(T,G,X)$ Another extreme is additive: - $Y = f(T) + s(X) + v(G)$ There are many combinations in between. The last three questions can be rephrased as: Do the groups $G$ impact the covariates $X$ , treatment $T$ , or outcome $Y$, and for the last two, do the groups interact with the covariates or is it an additive effect? ![](https://i.imgur.com/efnReLP.png) ### Simple example: partial linear model with two groups $$ Y = \theta_0TG_0 + \theta_1TG_1 + f(X)\\ T = m(X) $$ where $\theta_0\neq \theta_1$ are the treatment effects that differ by group. WLOG, assume group 1 is the smaller group. I'm assuming the groups have different covariate distributions. This corresponds to the general outcome model $Y = s(G,T) + f(X)$. The idea is the $f$ or $m$ model could overfit for group 1, and so the estimation of $\theta_1$ would be biased. But I'm confused about how to do this... The following doesn't make much sense... Notice we can write: $$ Y = Z \theta + f(X) \\ T = m(X) $$ where $Z = [TG_0\;\; TG_1]$ and $\theta\in\mathbb{R}^2$. Then to do double ML: - Predict $T$ with $\hat{m}(X)$ (or is it predict $Z$ with $\hat{m}(X)$) - Predict $Y$ with $\hat{f}(X)$ - Regress $(Y-\hat{f})$ on $(T-\hat{m}(X))$ (or is it on $(Z-\hat{m}(X))$) Are the moment conditions: $$ [(Y-\hat{f}(X)) - (Z-\hat{m}(X))\theta][Z-\hat{m}(X)] = \mathbf{0} $$ where $\hat{m}(X)$ predicts $Z$ or $$ [(Y-\hat{f}(X)) - (Z-\hat{m}(X))\theta][T-\hat{m}(X)] = \mathbf{0} $$ where $\hat{m}(X)$ predicts $T$ ### What's the right shared architecture? ![](https://i.imgur.com/TdGH3zJ.png)