Double ML, Dragonnet, and RieszNet

# Double ML, Dragonnet, and RieszNet This is a short background on Double ML and how it is used in a a couple estimation procedures involving neural nets. ---- Notation: $Y$: outcome $T$: treatment $X$: confounders $\theta$: treatment effect of interest (e.g. could be for binary treatment, or a derivative for a continuous treatment) --- [Double ML](https://arxiv.org/pdf/1608.00060.pdf) is an approach for estimating a treatment effect of interest, $\theta$, that has two key components: - *Component 1*: Estimate outcome model $g(T,X) = \mathbb{E}(Y|T,X)$ *and* propensity model $m(X) = \mathbb{E}[T|X]$ (this is where "double" comes from) and then plug both into an estimator for the $\theta$. This deals with "regularization bias." - *Component 2*: Use "cross-fitting" for estimation (i.e., fit $g$ and $m$ based on $K-1$ folds of data, estimate $\theta$ based on held out fold, then repeat. Average the $K$ estimates of $\theta$ to get final estimate). This deals with "overfitting bias." Component 1 is fully incorporated into the neural net models I've been discussing. Component 2 isn't but easily could be. I believe it has been tried but then removed due to poor empirical performance. Last week I showed results for [Dragonnet](https://arxiv.org/pdf/1906.02120.pdf) (among other methods), which incorporates Component 1 and can incorporate Component 2: - Incorporates Component 1 by estimating $g(T,X)$ and $m(X)$ using a single neural network with multiple heads. - Can incorporate Component 2, and to "some extent" it was, but it hurt performance. By "some extent" I mean only a single split and a single estimation of $\theta$ was made on the held out data. In other words, they estimated $g$ and $m$ on a train set and estimated $\theta$ on a test set. This really isn't cross-fitting (since they didn't flip the train/test split to get a second estimate of $\theta$, which they would then average with the first estimate of $\theta$). Interestingly, the authors of Riesznet (see below) implied it was. Last week we also discussed [RieszNet](https://arxiv.org/pdf/2110.03031.pdf) and the accompanying tree-based RieszForest, which are also close to double ML: - Incorporate something similar to Component 1 (basically instead of estimating $m(X) = \mathbb{E}[T|X]$, they estimate the "Riesz representer" of the treatment effect of interest, which I won't explain here). - Like Dragonnet, RieszNet can incorporate Component 2, but they don't discuss it or show results. However, they do implement it for RieszForest and show it helps performance. My guess is Component 2 did not help RieszNet or else they would have shown the results. They claim cross-fitting hurt the performance of Dragonnet, although as discussed above, Dragonnet did not really implement cross-fitting. Note that RieszNet differs from Dragonnet in that: - RieszNet can estimate more general quantities (e.g. derivative w.r.t. continuous treatment). - In the case of a binary treatment, the estimation procedures differ slightly. Dragonnet estimates the propensity score $m(X)$ and then plugs it into a certain equation, whereas RieszNet estimates this equation directly (which is the "Riesz representer").