# Mismatched tails as NN inputs We want to consider a targeting data which has the form: $$\begin{align} X_1 &\sim \mathcal{N}(0, 1) \\ X_2 &\sim extreme \\ X_3 | X_1, X_2 &\sim \mathcal{N}(f(X_1, X_2), 1) \end{align}$$ Even where $f(X_1, X_2)$ is simple, we expect it to be hard to fit a good model, due to the presence of large values in $X_2$ making optimisation hard. To isolate this problem, we can consider a few different cases for $X_2$. Experiment: - Sample $X_1$ and $X_2$, target $Y = X_1 + X_2$ - Fit a 2 layer, width 10 MLP with ReLU activation - I run an adam optimizer for a fixed number of iterations with batch size=100 and learning rate 1e-3 - I train on 1000 samples, with 100 used for validation - Loss is MSE **Target 1** $X_2\sim N(0, 1e7^2)$ The plot of optimisation loss over iteration is given below: ![](https://i.imgur.com/ATFLefB.png) We can see that the optimisation starts doing something sensible, but eventually encounters problems. Despite this, taking the parameter set with lowest validation loss, we see that fixing particular $X_2$ we achieve fits which, although not perfect, are not catastrophic: ![](https://i.imgur.com/iqGoyo7.png) **Target 2** $X_2 \sim StudentT(0.2)$ In this very extreme case, the NN cannot learn anything sensible. ![](https://i.imgur.com/InPjFPe.png) ![](https://i.imgur.com/vjoDzsQ.png) **Target 3** $X_2 \sim StudentT(0.5)$ This is a less extreme case, we see a similar pattern to the first target, with a highly noisy optimistation path. ![](https://i.imgur.com/kcfC3Q8.png) Similar to that example, the NN's predictions are not catastrophic. ![](https://i.imgur.com/6tkoY88.png) **Target 4** $X_2 \sim StudentT(0.2)$ (clipped gradients) In this experiment I clip the norm of the gradient update. This obviously restricts the magnitude of the updates, but does appear to stabilise the optimistaion somewhat. ![](https://i.imgur.com/hefd6WS.png) The resulting function appears to be OK. ![](https://i.imgur.com/JANGSw3.png) ### Thoughts - From this it appears that the Jaini approach would definitely struggle in the case $X_2 \sim StudentT(0.2)$ but in the others the optimisation finds solutions which I expect would be OK (not sure how true that is) - Even though the optimisations are noisy, if we pick the model with lowest validation loss it seems to provide ok fits - My suspicion is that gradient clipping would help stabalise optimisation in these cases ### Our approach It is not clear to me that our approach will solve this issue. In our approach, $Z_1, Z_2, Z_3$ are all standard normal. For now, we can ignore the spline transformation. We have can find optimal tail transformations for $X_1 = T^*_1(Z_1; \lambda=-1)$, and $X_2 = T^*_2(Z_2; \lambda\approx 2)$, with no affine component. To generate $X_3$ we perform a two stage transformation, first tail, then affine to generate $X_3$. Both of these transformations depend on $Z_1, Z_2$. The tail transformation will optimally be an identity $Z_3 = T(Z_3; \lambda=-1)$ (Normal asymptotics). Then we will have the affine transformation $T_A(Z_3; \mu(Z_1, Z_2)) = Z_3 + \mu(Z_1, Z_2)$. Optimally, we want $\mu(Z_1, Z_2) = X_1 + X_2 = Z_1 + T^*_2(Z_2; \lambda\approx 2)$. Learning this function with a neural network does not appear to be an easier prospect, even though both inputs are on the same scale. ## Variation Instead of trying to fit $\mu(Z_1, Z_2) = X_1 + X_2 = Z_1 + T^*_2(Z_2; \lambda\approx 2)$, we can think of fitting a problem where. # Density estimation As described, $$\begin{align} X_1 &\sim \mathcal{N}(0, 1) \\ X_2 &\sim extreme \\ X_3 | X_1, X_2 &\sim \mathcal{N}(f(X_1, X_2), 1) \end{align}$$