# Mismatched tails as NN inputs
We want to consider a targeting data which has the form:
$$\begin{align}
X_1 &\sim \mathcal{N}(0, 1) \\
X_2 &\sim extreme \\
X_3 | X_1, X_2 &\sim \mathcal{N}(f(X_1, X_2), 1)
\end{align}$$
Even where $f(X_1, X_2)$ is simple, we expect it to be hard to fit a good model, due to the presence of large values in $X_2$ making optimisation hard.
To isolate this problem, we can consider a few different cases for $X_2$.
Experiment:
- Sample $X_1$ and $X_2$, target $Y = X_1 + X_2$
- Fit a 2 layer, width 10 MLP with ReLU activation
- I run an adam optimizer for a fixed number of iterations with batch size=100 and learning rate 1e-3
- I train on 1000 samples, with 100 used for validation
- Loss is MSE
**Target 1** $X_2\sim N(0, 1e7^2)$
The plot of optimisation loss over iteration is given below:

We can see that the optimisation starts doing something sensible, but eventually encounters problems.
Despite this, taking the parameter set with lowest validation loss, we see that fixing particular $X_2$ we achieve fits which, although not perfect, are not catastrophic:

**Target 2** $X_2 \sim StudentT(0.2)$
In this very extreme case, the NN cannot learn anything sensible.


**Target 3** $X_2 \sim StudentT(0.5)$
This is a less extreme case, we see a similar pattern to the first target, with a highly noisy optimistation path.

Similar to that example, the NN's predictions are not catastrophic.

**Target 4** $X_2 \sim StudentT(0.2)$ (clipped gradients)
In this experiment I clip the norm of the gradient update.
This obviously restricts the magnitude of the updates, but does appear to stabilise the optimistaion somewhat.

The resulting function appears to be OK.

### Thoughts
- From this it appears that the Jaini approach would definitely struggle in the case $X_2 \sim StudentT(0.2)$ but in the others the optimisation finds solutions which I expect would be OK (not sure how true that is)
- Even though the optimisations are noisy, if we pick the model with lowest validation loss it seems to provide ok fits
- My suspicion is that gradient clipping would help stabalise optimisation in these cases
### Our approach
It is not clear to me that our approach will solve this issue.
In our approach, $Z_1, Z_2, Z_3$ are all standard normal.
For now, we can ignore the spline transformation.
We have can find optimal tail transformations for $X_1 = T^*_1(Z_1; \lambda=-1)$, and $X_2 = T^*_2(Z_2; \lambda\approx 2)$, with no affine component.
To generate $X_3$ we perform a two stage transformation, first tail, then affine to generate $X_3$.
Both of these transformations depend on $Z_1, Z_2$.
The tail transformation will optimally be an identity $Z_3 = T(Z_3; \lambda=-1)$ (Normal asymptotics).
Then we will have the affine transformation $T_A(Z_3; \mu(Z_1, Z_2)) = Z_3 + \mu(Z_1, Z_2)$.
Optimally, we want $\mu(Z_1, Z_2) = X_1 + X_2 = Z_1 + T^*_2(Z_2; \lambda\approx 2)$.
Learning this function with a neural network does not appear to be an easier prospect, even though both inputs are on the same scale.
## Variation
Instead of trying to fit $\mu(Z_1, Z_2) = X_1 + X_2 = Z_1 + T^*_2(Z_2; \lambda\approx 2)$, we can think of fitting a problem where.
# Density estimation
As described,
$$\begin{align}
X_1 &\sim \mathcal{N}(0, 1) \\
X_2 &\sim extreme \\
X_3 | X_1, X_2 &\sim \mathcal{N}(f(X_1, X_2), 1)
\end{align}$$