# Double ML
## and an application for causal feature selection in time series data
Emmanouil Angelis
---
## A motivating example
Consider multiple linear regression $Y= \alpha_1X_1 + \alpha_2X_2 + \epsilon$
How should we interpret the coefficients?
----
| | | | | | | | | |
| -| -| -| -| - | - |-|-| -|
| X1 | 0|1| 2 |3 |0 |1 |2 |3 |
| X2 | −1| 0| 1 |2 |1 |2 |3 |4 |
| Y | 1| 2 |3 |4 |−1 |0 |1 |2 |

----
| | | | | | | | | |
| -| -| -| -| - | - |-|-| -|
| $X_1$ | 0|1| 2 |3 |0 |1 |2 |3 |
| $X_2$ | −1| 0| 1 |2 |1 |2 |3 |4 |
| Y | 1| 2 |3 |4 |−1 |0 |1 |2 |
Multiple regression OLS solution describes the data points exactly
$Y= 2X_1 − X_2 \quad \forall i \quad (\sigma^2 = 0)$
Single regression OLS solution yields
$Y = \frac{1}{9} X_2 + \frac{4}{3} \quad (\sigma^2 = 1.72)$
----
### Multiple regression: $Y= 2x_1 − x_2$
* describe how Y is changing when varying one predictor and keeping the other **fixed**
* Y decreases when $X_2$ increases.
### Single regression: $Y= \frac{1}{9} X_2 + \frac{4}{3}$
* describe how y is changing when varying one predictor and **ignoring** the other
* Y increases when $X_2$ increases.
---
## How should we interpret coefficients?
Coefficient $\alpha_2$ quantifies the influence of $X_2$ on $Y$ after having subtracted the effect of $X_1$ on Y
* Predict $Y$ using only $X_1$ in an optimal way: $Y \approx \beta_1X_1$
* Now use $X_2$ (and $X_1$) to predict the residual $Y-\beta_1X_1$ in an optimal way: $Y-\beta_1X_1 \approx \alpha_2X_2^{'}$
----
## Coefficient Interpretation

Residual $Y-\beta_1X_1$ (orange)
is predicted by $X_2^{'}$ (green)
and we get $\alpha_2$
----
### Orthogonalization (QR Decomposition)
Coefficient $\alpha_2$ quantifies the influence of $X_2$ on $Y$ after having subtracted the effect of $X_1$ on Y
* Predict $Y$ using only $X_1$ in an optimal way: $Y \approx \pi_{X_1}(Y)$
* Now use $X_2 - \pi_{X_1}(X_2)$ to predict the residual $Y- \pi_{X_1}(Y)$ in an optimal way: $Y- \pi_{X_1}(Y) \approx \alpha_2(X_2 - \pi_{X_1}(X_2))$
And we get $\alpha_2$
Note: $X_2 - \pi_{X_1}(X_2)$ is also residual
---
## In summary: $Y= \alpha_1X_1 + \alpha_2X_2$
To infer coefficients we can use:
* Classic multiple regression
* Orthogonalization: use one residual to predict the other
They are equivalent
---
## Double/Orthogonal ML

Consider the following setting (PLR):
$Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$
<!---$D=m_0(X)+V, \quad \mathbb{E}[V |X]=0$-->
Want to infer $\theta_0$
----
### Straightforward way
$Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$
$\theta_0$ target, $g_0$ nuissance parameter
“Naive” or Prediction-Based ML Approach is Bad:
* Predict Y using D and X – and obtain $D \hat{\theta_0} + \hat{g_0}(X)$
* For example, estimate by alternating minimization– given initial guesses, run Random Forest of $Y - D \hat{\theta_0}$ on X to fit $\hat{g_0}(X)$ and the Ordinary Least Squares on $Y-\hat{g_0}(X)$ on D to fit $\hat{\theta_0}$; Repeat until convergence.
----
### “Naive” ML Approach is Bad
Excellent prediction performance! BUT the distribution of $\hat{\theta_0} - \theta_0$ looks like this:

----
### Orthogonalization way
$Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$
* Optimally predict (=project) Y and D using X by $\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$
* Regress $Y - \hat{\mathbb{E}}[Y|X]$ on $D - \hat{\mathbb{E}}[D|X]$
and infer $\theta_0$
Like before, we regressed one residual on the other
Original nuissance parameter $g_0(X)$
Now $\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$ (double ML)
----
### Double/Orthogonal ML is Good
Now the distribution of $\hat{\theta_0} - \theta_0$ looks like this:

---
### Sample Splitting
Split original sample S in <span><!-- .element: class="fragment highlight-red" -->S1</span> and <span><!-- .element: class="fragment highlight-blue" -->S2</span>
$Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$
* Optimally predict (=project) Y and D using X by <span><!-- .element: class="fragment highlight-red" -->$\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$</span>
* Regress <span><!-- .element: class="fragment highlight-blue" -->$Y - \hat{\mathbb{E}}[Y|X]$ on $D - \hat{\mathbb{E}}[D|X]$</span>
and infer $\theta_0 \approx$ <span><!-- .element: class="fragment highlight-blue" -->$\hat{\theta_0}(S2)$</span>
----
### Sample Splitting
Split original sample S in <span><!-- .element: class="fragment highlight-red" -->S1</span> and <span><!-- .element: class="fragment highlight-blue" -->S2</span>
and infer $\theta_0 \approx$ <span><!-- .element: class="fragment highlight-blue" -->$\hat{\theta_0}(S2)$</span>
Now interchange the roles of S1 and S2 and infer
$\theta_0 \approx$ <span><!-- .element: class="fragment highlight-red" -->$\hat{\theta_0}(S1)$</span>
Final estimate $\hat{\theta_0} = \frac{1}{2}$(<span><!-- .element: class="fragment highlight-red" -->$\hat{\theta_0}(S1)$</span>+<span><!-- .element: class="fragment highlight-blue" -->$\hat{\theta_0}(S2)$</span>)
----
### Cross Fitting
Leads to more accurate results but needs more work
In general, we can partition sample S in $K$ subsets $$S=\cup_{k=1}^KS_k$$ and estimate $$\hat{\theta_0} = \frac{1}{K}\sum_{k=1}^{K}\hat{\theta_0}(S_k)$$
---
### Remarks
* Orthogonalization goes back to 1959 (Neyman)
* General formulation: distribution $P(X;\theta_0,\eta_0)$
* Want to infer target parameter $\theta_0$
* But should cope with nuissance parameter $\eta_0$
* Naive way: estimate $\hat{\eta_0}$ in a separate sample
Now plug in $\hat{\eta_0}$ to $P(W;\theta_0,\hat{\eta_0})$ and estimate $\hat{\theta_0}$
It's too sensitive to mispecification of $\hat{\eta_0}$
(high bias for us)
---
### Math Formulation
We have a model $P(;\theta,\eta)$ for the data distribution
We observe data $W=(W)_{i=1}^N$, coming from $P(;\theta_0,\eta_0)$
Want to maximize $log(P(W;\theta,\eta))$ wrt $\theta$
Take the derivative wrt $\theta$ and set it to 0
Score function: $\psi(W,\theta,\eta):=\frac{d}{d\theta}log(P(W;\theta,\eta))$
Maximum likelihood: solve $\mathbb{E}_N[\psi(W,\theta,\eta_0)]=0$ wrt $\theta$
----
### Math Formulation example
$Y=\alpha_0 X + \epsilon$, $\epsilon \sim \mathcal{N}(\mu,\,\sigma^{2})$
* Regress Y on X
* log likelihood: $-\frac{1}{2}||y-\alpha x||^2 + C$
* score: $\mathbb{E}_N[\psi(W,\theta)]=<x,y-\alpha x>$
----
$Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$
Naive way:
* estimate $\hat{g_0}$ in separate sample
* Regress $Y - \hat{g_0}$ on D in another sample
Orthogonal way:
* estimate $\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$ in separate sample
* Regress $Y - \hat{\mathbb{E}}[Y|X]$ on $D-\hat{\mathbb{E}}[D|X]$ in another sample
log likelihood and score as above: just substitute
---
### General Procedure
Estimate $\hat{\eta_0}$ in a separate sample
Plug in: solve $\mathbb{E}_N[\psi(W,\theta,\hat{\eta_0})]=0$ wrt $\theta$
We seek $\mathbb{E}[\psi(W,\theta,\eta)]$ insensitive to misspecification of $\eta_0$, i.e
$$\frac{d}{d\eta}\mathbb{E}[\psi(W,\theta,\eta) |_{\eta=\eta_0} =0 $$
We achieve that with the orthogonalized but not with the naive approach
---
### Combine Double ML and Cross Fitting
The estimator $\hat{\theta_0}$ of $\theta_0$ achieves $\frac{1}{\sqrt{n}}$ convergence rate
when the **product** of the convergence rates of $\hat{\mathbb{E}}[Y|X]$ and $\hat{\mathbb{E}}[D|X]$ is better than $\frac{1}{\sqrt{n}}$
**NOT** Double Robustness property
---
### Double ML and Causality
We applied Double ML to a specific problem:
$Y =D\theta_0 +g_0(X)+U, \quad \mathbb{E}[U |X,D]=0$
Turns out that it can be applied to other Causality problems as well
(maybe not with orthogonality interpretation)
e.g ATE estimation
---
### Application: Causal Feature Selection
Consider a set of (observed) features $\mathbf{X} = \{X_1, . . . , X_m\}$ and outcome $Y$
$Y = f(Pa(Y )) + \epsilon$, where $PA(Y) \subset \mathbf{X}$
Assumptions:
* $\epsilon$ is exogenous noise, independent of $\mathbf{X}$
* Y has no direct causal effect on any of the features
Goal: determine $PA(Y)$
---
### ACDE
We want to determine whether $X_j \in PA(Y)$. This happens iff the **Average Controlled Direct Effect**,
$$ACDE(x_j, x'_j| \boldsymbol{x}_j^c)=0 \quad \forall (x_j, x'_j, \boldsymbol{x}_j^c),$$ where $ACDE(x_j, x'_j| \boldsymbol{x}_j^c) :=$
$$ \mathbb{E} [Y\mid do(x_j,\boldsymbol{x}_j^c)] - \mathbb{E}[Y\mid do(x'_j,\boldsymbol{x}_j^c)] $$
---
### Double ML for ACDE
Instead of trying all triplets $(x_j, x'_j, \boldsymbol{x}_j^c)$, simply estimate
$\chi_j := \mathbb{E}_{(x_j, \boldsymbol{x}_j^c) \sim (X_j, \boldsymbol{X}_j^c)}\left [\left ( \mathbb{E}[Y | x_j, \boldsymbol{x}_j^c] - \mathbb{E}[Y | \boldsymbol{x}_j^c] \right )^2\right]$
and check with a paired t-test whether $\chi_j \equiv 0$ or not
We can estimate $\chi_j$ with **Double ML**
and achieve **Double Robustness**
---
### Causal Feature Selection for Time Series
Given: $\boldsymbol{Y} := \{Y_t\}_{t \in \mathbb{Z}}$ and $\boldsymbol{X} := \{X_t^1, \dots, X_t^m\}_{t \in \mathbb{Z}}$
$Y_T = f(\mathsf{pa}_T(\boldsymbol{Y}), T) + \varepsilon_T$,
where $\mathsf{pa}_T(\boldsymbol{Y}) \subseteq \{X_t^1, \dots, X_t^m\}_{t \in \mathbb{Z}}$
Goal: identify $\boldsymbol{X}^i$ s.t $X^i_{t} \subseteq \mathsf{pa}_T(\mathbf{Y})$ for some time steps $t, T$
{"title":"slides","slideOptions":"{\"transition\":\"slide\"}","contributors":"[{\"id\":\"232d51e1-b411-4cd1-8efc-cac979c0bfb5\",\"add\":10948,\"del\":1088}]","description":"Emmanouil Angelis"}