slides - HackMD

# Double ML ## and an application for causal feature selection in time series data Emmanouil Angelis --- ## A motivating example Consider multiple linear regression $Y= \alpha_1X_1 + \alpha_2X_2 + \epsilon$ How should we interpret the coefficients? ---- | | | | | | | | | | | -| -| -| -| - | - |-|-| -| | X1 | 0|1| 2 |3 |0 |1 |2 |3 | | X2 | −1| 0| 1 |2 |1 |2 |3 |4 | | Y | 1| 2 |3 |4 |−1 |0 |1 |2 | ![aa0f9c0e-d5ec-4da6-b7ed-3518c8000e97](https://hackmd.io/_uploads/rk1co-I86.png =350x) ---- | | | | | | | | | | | -| -| -| -| - | - |-|-| -| | $X_1$ | 0|1| 2 |3 |0 |1 |2 |3 | | $X_2$ | −1| 0| 1 |2 |1 |2 |3 |4 | | Y | 1| 2 |3 |4 |−1 |0 |1 |2 | Multiple regression OLS solution describes the data points exactly $Y= 2X_1 − X_2 \quad \forall i \quad (\sigma^2 = 0)$ Single regression OLS solution yields $Y = \frac{1}{9} X_2 + \frac{4}{3} \quad (\sigma^2 = 1.72)$ ---- ### Multiple regression: $Y= 2x_1 − x_2$ * describe how Y is changing when varying one predictor and keeping the other **fixed** * Y decreases when $X_2$ increases. ### Single regression: $Y= \frac{1}{9} X_2 + \frac{4}{3}$ * describe how y is changing when varying one predictor and **ignoring** the other * Y increases when $X_2$ increases. --- ## How should we interpret coefficients? Coefficient $\alpha_2$ quantifies the influence of $X_2$ on $Y$ after having subtracted the effect of $X_1$ on Y * Predict $Y$ using only $X_1$ in an optimal way: $Y \approx \beta_1X_1$ * Now use $X_2$ (and $X_1$) to predict the residual $Y-\beta_1X_1$ in an optimal way: $Y-\beta_1X_1 \approx \alpha_2X_2^{'}$ ---- ## Coefficient Interpretation ![diagram](https://hackmd.io/_uploads/SyPZ8B8LT.png) Residual $Y-\beta_1X_1$ (orange) is predicted by $X_2^{'}$ (green) and we get $\alpha_2$ ---- ### Orthogonalization (QR Decomposition) Coefficient $\alpha_2$ quantifies the influence of $X_2$ on $Y$ after having subtracted the effect of $X_1$ on Y * Predict $Y$ using only $X_1$ in an optimal way: $Y \approx \pi_{X_1}(Y)$ * Now use $X_2 - \pi_{X_1}(X_2)$ to predict the residual $Y- \pi_{X_1}(Y)$ in an optimal way: $Y- \pi_{X_1}(Y) \approx \alpha_2(X_2 - \pi_{X_1}(X_2))$ And we get $\alpha_2$ Note: $X_2 - \pi_{X_1}(X_2)$ is also residual --- ## In summary: $Y= \alpha_1X_1 + \alpha_2X_2$ To infer coefficients we can use: * Classic multiple regression * Orthogonalization: use one residual to predict the other They are equivalent --- ## Double/Orthogonal ML ![xdy_dag](https://hackmd.io/_uploads/Byp-4JDIp.png =200x) Consider the following setting (PLR): $Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$  Want to infer $\theta_0$ ---- ### Straightforward way $Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$ $\theta_0$ target, $g_0$ nuissance parameter “Naive” or Prediction-Based ML Approach is Bad: * Predict Y using D and X – and obtain $D \hat{\theta_0} + \hat{g_0}(X)$ * For example, estimate by alternating minimization– given initial guesses, run Random Forest of $Y - D \hat{\theta_0}$ on X to fit $\hat{g_0}(X)$ and the Ordinary Least Squares on $Y-\hat{g_0}(X)$ on D to fit $\hat{\theta_0}$; Repeat until convergence. ---- ### “Naive” ML Approach is Bad Excellent prediction performance! BUT the distribution of $\hat{\theta_0} - \theta_0$ looks like this: ![εικόνα](https://hackmd.io/_uploads/ryLYZlP8a.png =400x) ---- ### Orthogonalization way $Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$ * Optimally predict (=project) Y and D using X by $\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$ * Regress $Y - \hat{\mathbb{E}}[Y|X]$ on $D - \hat{\mathbb{E}}[D|X]$ and infer $\theta_0$ Like before, we regressed one residual on the other Original nuissance parameter $g_0(X)$ Now $\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$ (double ML) ---- ### Double/Orthogonal ML is Good Now the distribution of $\hat{\theta_0} - \theta_0$ looks like this: ![image](https://hackmd.io/_uploads/SkQ1TlPI6.png =400x) --- ### Sample Splitting Split original sample S in S1 and S2 $Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$ * Optimally predict (=project) Y and D using X by $\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$ * Regress $Y - \hat{\mathbb{E}}[Y|X]$ on $D - \hat{\mathbb{E}}[D|X]$ and infer $\theta_0 \approx$ $\hat{\theta_0}(S2)$ ---- ### Sample Splitting Split original sample S in S1 and S2 and infer $\theta_0 \approx$ $\hat{\theta_0}(S2)$ Now interchange the roles of S1 and S2 and infer $\theta_0 \approx$ $\hat{\theta_0}(S1)$ Final estimate $\hat{\theta_0} = \frac{1}{2}$($\hat{\theta_0}(S1)$+$\hat{\theta_0}(S2)$) ---- ### Cross Fitting Leads to more accurate results but needs more work In general, we can partition sample S in $K$ subsets $$S=\cup_{k=1}^KS_k$$ and estimate $$\hat{\theta_0} = \frac{1}{K}\sum_{k=1}^{K}\hat{\theta_0}(S_k)$$ --- ### Remarks * Orthogonalization goes back to 1959 (Neyman) * General formulation: distribution $P(X;\theta_0,\eta_0)$ * Want to infer target parameter $\theta_0$ * But should cope with nuissance parameter $\eta_0$ * Naive way: estimate $\hat{\eta_0}$ in a separate sample Now plug in $\hat{\eta_0}$ to $P(W;\theta_0,\hat{\eta_0})$ and estimate $\hat{\theta_0}$ It's too sensitive to mispecification of $\hat{\eta_0}$ (high bias for us) --- ### Math Formulation We have a model $P(;\theta,\eta)$ for the data distribution We observe data $W=(W)_{i=1}^N$, coming from $P(;\theta_0,\eta_0)$ Want to maximize $log(P(W;\theta,\eta))$ wrt $\theta$ Take the derivative wrt $\theta$ and set it to 0 Score function: $\psi(W,\theta,\eta):=\frac{d}{d\theta}log(P(W;\theta,\eta))$ Maximum likelihood: solve $\mathbb{E}_N[\psi(W,\theta,\eta_0)]=0$ wrt $\theta$ ---- ### Math Formulation example $Y=\alpha_0 X + \epsilon$, $\epsilon \sim \mathcal{N}(\mu,\,\sigma^{2})$ * Regress Y on X * log likelihood: $-\frac{1}{2}||y-\alpha x||^2 + C$ * score: $\mathbb{E}_N[\psi(W,\theta)]=<x,y-\alpha x>$ ---- $Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$ Naive way: * estimate $\hat{g_0}$ in separate sample * Regress $Y - \hat{g_0}$ on D in another sample Orthogonal way: * estimate $\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$ in separate sample * Regress $Y - \hat{\mathbb{E}}[Y|X]$ on $D-\hat{\mathbb{E}}[D|X]$ in another sample log likelihood and score as above: just substitute --- ### General Procedure Estimate $\hat{\eta_0}$ in a separate sample Plug in: solve $\mathbb{E}_N[\psi(W,\theta,\hat{\eta_0})]=0$ wrt $\theta$ We seek $\mathbb{E}[\psi(W,\theta,\eta)]$ insensitive to misspecification of $\eta_0$, i.e $$\frac{d}{d\eta}\mathbb{E}[\psi(W,\theta,\eta) |_{\eta=\eta_0} =0 $$ We achieve that with the orthogonalized but not with the naive approach --- ### Combine Double ML and Cross Fitting The estimator $\hat{\theta_0}$ of $\theta_0$ achieves $\frac{1}{\sqrt{n}}$ convergence rate when the **product** of the convergence rates of $\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$ is better than $\frac{1}{\sqrt{n}}$ **NOT** Double Robustness property --- ### Double ML and Causality We applied Double ML to a specific problem: $Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$ Turns out that it can be applied to other Causality problems as well (maybe not with orthogonality interpretation) e.g ATE estimation --- ### Application: Causal Feature Selection Consider a set of (observed) features $\mathbf{X} = \{X_1, . . . , X_m\}$ and outcome $Y$ $Y = f(Pa(Y )) + \epsilon$, where $PA(Y) \subset \mathbf{X}$ Assumptions: * $\epsilon$ is exogenous noise, independent of $\mathbf{X}$ * Y has no direct causal effect on any of the features Goal: determine $PA(Y)$ --- ### ACDE We want to determine whether $X_j \in PA(Y)$. This happens iff the **Average Controlled Direct Effect**, $$ACDE(x_j, x'_j| \boldsymbol{x}_j^c)=0 \quad \forall (x_j, x'_j, \boldsymbol{x}_j^c),$$ where $ACDE(x_j, x'_j| \boldsymbol{x}_j^c) :=$ $$ \mathbb{E} [Y\mid do(x_j,\boldsymbol{x}_j^c)] - \mathbb{E}[Y\mid do(x'_j,\boldsymbol{x}_j^c)] $$ --- ### Double ML for ACDE Instead of trying all triplets $(x_j, x'_j, \boldsymbol{x}_j^c)$, simply estimate $\chi_j := \mathbb{E}_{(x_j, \boldsymbol{x}_j^c) \sim (X_j, \boldsymbol{X}_j^c)}\left [\left ( \mathbb{E}[Y | x_j, \boldsymbol{x}_j^c] - \mathbb{E}[Y | \boldsymbol{x}_j^c] \right )^2\right]$ and check with a paired t-test whether $\chi_j \equiv 0$ or not We can estimate $\chi_j$ with **Double ML** and achieve **Double Robustness** --- ### Causal Feature Selection for Time Series Given: $\boldsymbol{Y} := \{Y_t\}_{t \in \mathbb{Z}}$ and $\boldsymbol{X} := \{X_t^1, \dots, X_t^m\}_{t \in \mathbb{Z}}$ $Y_T = f(\mathsf{pa}_T(\boldsymbol{Y}), T) + \varepsilon_T$, where $\mathsf{pa}_T(\boldsymbol{Y}) \subseteq \{X_t^1, \dots, X_t^m\}_{t \in \mathbb{Z}}$ Goal: identify $\boldsymbol{X}^i$ s.t $X^i_{t} \subseteq \mathsf{pa}_T(\mathbf{Y})$ for some time steps $t, T$