# Notes 2023-02-16 ## <u>Predicting the change in sigmoid-outputs</u> <!-- ![](https://i.imgur.com/r66d7Ez.png =2000x) --> ![](https://i.imgur.com/cdBaAGM.png =2000x) ## <u>Predicting the change in logits</u> For the logits, I am plotting: $∣f_{w_{*}} - f_{w_{-i}}∣ = ∣\alpha_{-i}* v_i∣$ Note that $\alpha_{-i}$ are the leave-one-out residuals of the neural network, unlike in the equation for sigmoid-outputs! See also the old writeup: ![](https://i.imgur.com/zoBQQRM.png) ## <u>New result (australian-scale)</u> Single hidden layer MLP, 100 neurons, ~2000 parameters Train acc 0.875, test acc 0.878 train_config = {'delta': 1e1, 'n_epochs': 1000, 'lr': 1e-2, 'bs_train': 32} retrain_config = {'n_epochs': 5, 'lr': 1e-2, 'bs_retrain': 32} Changes to before: - $w_*$ trained for more epochs - larger regularizer ("banana-shaped" memory maps) - nothing else changed in the code ### Memory map ![](https://i.imgur.com/NlNy51L.png) ### Change in logits ![](https://i.imgur.com/ufvn00l.png) ### Change in sigmoid-outputs: Equation 13, 15/16, 17 **Equation 13** ![](https://i.imgur.com/Q9OLCyJ.png) **Equation 15 = Equation 16** ![](https://i.imgur.com/ZgmFdyj.png) **Equation 17** ![](https://i.imgur.com/wKdfLQB.png) ### Change in sigmoid-outputs: Approximations using only components of Eq. 17 **lambda** ![](https://i.imgur.com/7aCjW1T.png) **alpha** ![](https://i.imgur.com/o1Ec0rm.png) **lambda x alpha** ![](https://i.imgur.com/RW4qdTU.png) <!-- ## <u>New result (USPS)</u> binary USPS CNN, ~5000 parameters --> ## <u>Conclusion & To Do's</u> **Conclusion about old results** - Some of my old experiments maybe were "degenerate" - too small regularizer, L-shaped memory map, overfitting - base model $w_{*}$ not trained for long enough **Conclusion about new results** - $∣h(f_{w_{*}}) - h(f_{w_{-i}})∣ = ∣\alpha_i * \lambda_i * v_i∣$ can also hold for large regularizer - $∣f_{w_{*}} - f_{w_{-i}}∣ = ∣\alpha_{-i}* v_i∣$ is also a valid linear-relationship? - base model $w_{*}$ needs to be optimized until convergence! - approximations (aim: predicting $∣h(f_{w_{*}}) - h(f_{w_{-i}})∣$) - for a well-regularized and properly optimized model: $∣\alpha_i * \lambda_i * v_i∣$ > $\lambda_i * \alpha_i$ > $\lambda_i$ and $\alpha_i$ - approximation-error due to GGN (for larger residuals) seems to be not that huge in this case? **To Do's** - Show on more and large model/data