Paper Outline - HackMD

# Paper Outline ###### general **Question we are trying to answer** covariance structure size of data with respect to width MAP training (determinant) claim that only a small number of bases fit the data? **Motivation** - **Current literature on NLM** **Currenty literature on XX** **Hypothesis/Facts that we'd like to be true:** - For a given number of bases and a given number of data points in the underparametrized regime, NLMs have a poorer uncertainty than other bases. - When real bases are included, they should be selected with higher precision and higher posterior mean - Transfer learning (no experiment yet): diverse set of basis functions do better in transfer learning/better uncertainty versus a similar set of basis functions - Posterior contraction allows us to measure uncertainty - at least in the 1d case, it correlates well with uncertainty. - the higher the precision, the higher the posterior contraction (considering a single basis) - the higher the posterior contraction, the more the usefulness of the bases, and the higher/better the uncertainty (we identify the correct basis - multiple are correct basis or something, disagree with one another) - the lower the posterior contraction, the more extreme the diversity of the functions and the worse the fit (less useful). - posterior contraction is low -> get similar bases -> no uncertainty?? - MAP gives us same bases ? # RESULTS and SUPPORTING EXPERIMENTS: - For each experiment, possibly test for different activation functions, different number of bases, datasets - Should be training to convergence - Fix 32 datapoints, cubic dataset, M = under: (10, 20, 30), over: (50, 100, 150). activation functions: LeakyReLU, Tanh (maybe ReLU) - Epochs: to convergence (10000+), - Layers: {[1, 50, numbases, 1], [1, 20, numbases, 1], [1, 20, 20, numbases, 1], [1, 50, 20, numbases, 1], [1, 20, 50, numbases, 1], [1, 50, 50, numbases, 1]} - MLE training (except when noted otherwise) 1. The higher the average precision, the higher the posterior contraction for a group of bases. - (mathematical fact) 2. The higher the posterior contraction, the more useful the bases - **EXPERIMENT**: Fit NLM using M bases (try different values of M). Then run a Bayesian linear regression on each basis individually. This will result in an increasing NLL as you go from (higher precision basis) -> (lower precision basis). - **EXPERIMENT**: Fit NLM using M bases (try different values of M). Then run a Bayesian linear regression on groups of bases that differ in precision, and the one with a higher precision should have better (lower) NLL than the group with lower precision. 3. The higher the posterior contraction, the better the uncertainty. - **EXPERIMENT**: Fit NLM using M bases (try different values of M). Then run a Bayesian linear regression on groups of bases that differ in precision, and the one with a higher precision should have better (higher) uncertainty and better (higher) variance of variance of their uncertainty than the group with lower precision. 4. This is because several bases are good at explaining the data, but they disagree with one another, so aggregated there is both a good fit and good uncertainty. - **EXPERIMENT**: (Testing the disagreement part) Check the (cosine?) similarity of the functions. 6. The lower the posterior contraction, the more similar the bases are. - **EXPERIMENT**: Fit NLM using M bases (try different values of M). Look at the cosine similarity matrix between bases of the lower precision and the cosine similarity matrix between bases of higher precision. Latter should on average have lower values (less correlated) than the former. 8. The more similar the bases are, the less uncertainty we get. - **EXPERIMENT**: MAP training produces more similar bases. MAP training produces poor uncertainty. The posterior contraction is consistently lower under MAP. Test by using MAP training on several NLMs vs. MLE training on several NLMs and look at the uncertainty between the two, for several random initializations.  - **Interesting Results** **Experiments to Do** For different width to number of data point ratios, - when we get bad uncertainty, what do the bases look like? - With regularization - when we get good uncertainty, what do the bases look like? - look at prior predictive samples for each of these bases