Unfiltered Research Notes

# Unfiltered Research Notes ###### tags: `general` ## Important Links - [Papers Drive](https://drive.google.com/drive/u/0/folders/1fd3ncgjI5V0e1xs_ztfLsyHMKwicZF2l) ## Resource Links: - http://cs229.stanford.edu/section/cs229-gaussian_processes.pdf - https://gregorygundersen.com/blog/2019/12/23/random-fourier-features/#a1-gaussian-kernel-derivation ### 5/9 - using one basis at a time, then as you add bases the MSE drops but how quickly? If it drops quickly then slowly we have an argument for our selection strategy ### 4/10 - try: take bases with lowest posterior covariance and rerun bayesian regression on those ### Mon. 4/4 - TODOS: - Test negative log likelihood (test a function that fits and a function that doesn't, former should have much lower loss than the other) - Max - debug gp's and rff's (should equate posterior predictives in the two equations) - Max - Hypotheses: 1. If the function is specified by the sum of two bases, then the posterior variance of these bases are lower and the posterior means of these bases are higher than the rest of the bases, regardless of the number of bases. [Done, kinda - need extra verification] - Lucy, will add relevant code to codebase 2. If the function is specified by the _weighted_ sum of two bases, then the posterior variance of these bases are lower and the posterior means of these bases are higher than the rest of the bases, regardless of the number of bases. (does the posterior mean of the bases become like, w=2 for x^3 and w=1 for x^2 if we have 2x^3 + x^2) - Lucy 3. If the function is specified by a sum of n bases (n >> 2), then this phenomena happens for these bases (test like x^3 + 2e^x + log x + invtan(x)) - Lucy, Max 4. How does the number of basis functions being "chosen" relate to the uncertainty? - Geoffrey 5. if the function is *approximated* by a sum of two bases (say, true function is $x^3$ and basis is $(x-0.001)^3$), then the posterior variance of these bases are lower (but not as low) and the posterior means of these bases are lower - Max 6. Train an NLM on a function, take the bases with the highest weights/lowest covariances, rerun the Bayesian regression and see how the fit (negative log likelihood) and the uncertainties are. (See if it approximates the function) - Geoffrey 7. Relate the uncertainty to the weights/variances of the posterior functions. We can do this by placing different weights on the prior functions, then just look at the prior predictive (as opposed to looking at the posterior predictive after training). - Geoffrey - Overall Hypothesis: - If an NLM has poor uncertainty after training, it is because some of the bases are being "chosen" and they have lower variance in the posterior and potentially weighted sum to the function, whereas the other bases have lower mean and lower variance.????? ### Wed. 3/30 - Test posterior contraction when you duplicate features (hypothesis: it doesn't increase when you do this) - Current ideas about project direction - It'd be nice to have diversity -> good uncertainty, where diversity is maybe how many bases in the posterior are being used, and how diverse that set of bases area. If we can correlate this metric with good uncertainty, then given a new model we can compute the metric and then be somewhat sure of how good the uncertainty is - We need bases that don't fit the data well, maybe this means the covariance in the posterior is high? Weiwei: - RFF: - posterior predictive vs. # bases graph increase number of bases - puts prior on likely bases, see how posterior changes/ favors these bases - changing noise - histogram of weights (with and without one true basis ) - Potential hypothesis on why NLM gives bad uncertainty: low variance on the bases chosen, or weighted bases add up to to a function that looks like the real data. - Add one true basis to Legendre experiment (as well as other bases we tried) - figure out the length scale of GP RFF with weights of normal(0,1) ### Mon. 3/28 POTENTIAL METRICS: - Posterior Contraction (for NLMs) - Uncertainty Area - Variance of Uncertainty (diversity) EXPERIMENT 1: Investigating how additional RFFs affect the posterior predictive - Implementations needed: - A function that, given posterior means, plots a histogram of the means - A function that, given posterior covariances, plots a histogram of the variances EXPERIMENT 2: Bakeoff into how different models change as we increase the number of bases - Sets of models: - NLM with TanH - NLM with LeakyReLU - RandomFourierFeatures - Legendre Polynomials - Implementations needed: - Need to have potential metrics implemented well first - Investigate how the following bases MISCELLANEOUS: - Add uncertainty area, uncertainty variance to the bottom of the posterior predictive plot ### Mon. 3/21 - M(B,D), doesn't really make sense to include ID and OOD because they are open problems of their own - Instead of looking at good uncertainty vs. bad uncertainty, look at good diversity vs. bad diversity (eg. more variance in the middle area in the posterior predictive) - Experiment: one basis is correct, the others are RFF (cos(wx+b)). Test as the number of random bases increases, which bases are selected. It might be that when number of random bases increases, the posterior chooses the random ones as well. - what does the posterior do? does it collapse on the correct one? - complexity of models relative to the data (how complicated is the dataset?) - scaling the bases? different training regimes - Ideas: - test "indicator functions" ## Week 7 - Other people Weiwei works with working on: - Bayes Factor + Marginal Log likelihood to determine fit of the model - but these only work with non-nestedness (e.g. not for NLM's - could have same number of parameters but completely different parameters) - approximate the bases with polynomials? - Let's not use ReLU networks, so let's choose a bounded activation - reach out to Yaniv for legacy codebase - random initializations correspond to random fourier feature basis (latter is one layer, weights randomly sampled, cosine activations) - training not as big of an effect as initializations -> some random bases are better than others - From infinite basis team - maybe the larger the width, the more similar training is to initialization? - from weiwei: ntk theory tells us that the weights don’t change much during training for large networks ## Week 6 ### Wed. 3/3 - using kernel approximation (e.g. we know infinite random bases forms a GP - maybe the more close we get to these bases, the better the uncertainty?) - How many bases do we need to differentiate between random and trained bases? - So if initializations mostly determine the answer, still need to determine metric for good basis. - Relationship to BIC/AIC metrics (model fit - model degrees of freedom) - Conservative Uncertainty Estimates... paper ## Week 5 ### Sat. 2/25 1. Seems to have found correlation between uncertainty area obtained from random bases with trained bases - hypothesis: only dependent on initialization. Maybe pivot the problem to studying how to get good initialization? 2. Dropping out bases one by one - seems like non-smooth functions do affect prior predictive more - maybe getting a subset of top n (eg. 3) good bases? 4. BNN, ANN -> choose good points in the prior, can we get good uncertainties? 5. structural and parametric uncertainty in epistemic uncertainty ### Wed. 2/23 1. Need to care about both data fit and unceratinty. 2. Maybe fix the number of basis and see what types of basis lie on the pareto optimal frontier for data fit and uncertainty 3. look at linear combination of basis with the weights instead of just looking at the basis 4. TODO: How basis affect prior expressivity (fix number of basis functions) - 100 datapoints and 15-20 bases - draw random weights and plot linear combination of basis functions 6. TODO: Connect prior uncertainty with posterior uncertainty 7. span of basis functions, but with N(0,1) as weights https://www.when2meet.com/?14724590-3OHsf ## Week 4 ### Fri. 2/18 Things to investigate/topics to consider: MEASURING UNCERTAINTY: 1) GP uncertainty? (Use GPytorch and train on cubic gap dataset) 2) Measuring 2nd derivative of the 95% confidence intervals? TESTING BASES FOR PERFORMANCE 3) Random bases performance (calculate negative log likelihood of the data under the random bases) 4) Craft bases by hand and create hypotheses as to which ones will give good uncertainty + test those hypotheses 4a) e.g. Polynomial/legendre polynomial/random Fourier Features TRAINING DATA 5) Currently using (-1, -0.2) U (0.2, 1). Can we do other intervals and see how this affects the uncertainty? (Create hypotheses and test) 5a) Also how do random draws of the data affect the same model? (Create testing datasets and test negative log likelihood) MODEL INFORMATION 6) How do activation functions affect negative log likelihood and uncertainty? 7) Using MAP training? MEASURING DIVERSITY OF BASES 8) Using cosine similarity of the gradients of the functions? 9) Test how the magnitude of the bases affects each quantity (effective dimensionality, cosine similarity, prediction uncertainty) ### Sun. 2/13 - How to measure uncertainty better than integrating the area? - visualizing - bases that have good and bad bubbles - perturbing the function along the eigenvalues of large direction should lead to poor fit of the data, perturbing the function along eigenvalues of small direction should lead to data fitting but different fit in the uncertain region - functional analysis? frequency analysis? - need a lot of bad and good basis examples - hypothesis about the number of zero functions depending on the data - maybe it's useful in transfer learning - fix a large number of nodes, retrain and obtain effective dimensionality vs uncertainty. - visualize the uncertainty/basis functions - see which basis functions are picked (check out linear combinations for functions that are outliers) ## Week 3 ### Wed. 2/9 - Task: come up with metric on diversity of bases (Lucy, Max) - effective dimensionality of correlation matrix? (try different correlation matrices (outlier removing matrices?)) - try wrt data and wrt entire basis - cosine similarity? - metric on final function (Geoffrey) - total variation? - area in between curve? (Max) Max's resources on GP's/Kernels/Feature Maps - Kernels and Feature maps: Theory and intuition: https://xavierbourretsicotte.github.io/Kernel_feature_map.html - GPs section notes by CS229 Stanford: http://cs229.stanford.edu/section/cs229-gaussian_processes.pdf - Understanding the NTK Blog: https://rajatvd.github.io/NTK/ - NTK Paper: http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf - Kernel Regression: https://towardsdatascience.com/kernel-regression-made-easy-to-understand-86caf2d2b844 - GP's for ML Chapter 2 on Regression: http://www.gaussianprocess.org/gpml/chapters/RW2.pdf ### Mon. 2/7 - Weiwei suggests plotting prior predictives: After training the first step, sample from prior and plot prior predictives - metric to evaluate complexity of sets of bases (metric to compare two sets of bases) - quantify complexity - After having that metric, vary width etc. to see how it gets affected - Regarding the architecture + basis quesion, could look into hypotheses of what representations deep NN and wide NN learn - What part of the data each type of those basis functions capture. Workflow layout: 1. Sets of different bases and prior+ posterior predictive they make. See if they're good or bad bases **generalize that to get metric of complecity (M)** - see if using the parameter counting paper metric is sufficient to tell them apart. Weiwei thinks it's not really the largest eigenvectormetric - eigenvalues and determinants - counting which function contributes to your data - Can also try posterior contraction metric? 3. Use M to relate to all the different things like quality of uncertainty, transfer learning/representation learning, architecture question/ width goes to infinity question. ### Sun. 2/6 Basis correlation: - Since it's second to last layer, measuring linear correlation between the basis functions should be enough? - effective dimensionality of correlation matrix of functions - test metric on different data sets later - does a diverse set of basis functions do better in transfer learning versus a similar set of basis functions? - questions for weiwei - measures of diversity between functions? (1) - measures for quantifying uncertainty (2) - what to do if we find a measure (1) that correlates with (2)? - UNA paper reproducing results? - simplest examples for testing transfer learning? ## Week 2 ### Fri. 2/4 measure diversity of functions: - variance at datapoints / linear correlations between functions quantify uncertainty - as you increase the number of basis functions, maybe some basis functions become useless (redundant) - prune? - keep going with current directions, sunday put together presentation ### Wed. 2/3 1) Geoff - Architecture (depth, width, size) affect uncertainty? - Prior (weight initialization and prior on the bayesian part) affect uncertainty> - Hypothesis 1: Deeper -> less variation in the functions -> less good uncertainty - Hypothesis 2: Wider -> more variation in the functions (??) -> better uncertainties - Papers about Bayesian Deep Ensembles.. - Bayesian Deep Ensembles via the Neural Tangent Kernel 2) Lucy - Regularization (MAP) - code up MAP and Maximum Marginal Likelihood from UNA - Papers about GP's, joint learning 3) Max - Visualizing these bases (?) - Compare bases for regimes of good/poor uncertainties - Good uncertainties -> more variety in the basis - Bad uncertainties -> every basis function is similar - Paper about An Empirical Study on The Properties of Random Bases for Kernel Methods ### Tues. 2/1 - [Deepnote](https://deepnote.com/project/NLM-xCFXvPK4QyqurzE66hiavA/%2Fnotebook.ipynb). - Quick question: for Step 1 in NLM I thought it's also treated as a BNN (for all weights) and we take the MAP. But the code looks like it's just a fully-connected NN? - Random idea: Apply expressiveness/ effective dimensionality criteria to [Subnetwork inference](https://arxiv.org/abs/2010.14689). ### Mon. 1/31 - **Research Meeting** 3:00pm - given hypothesis, try to test it with quick fails (try hard to disprove?) - Gaussian workshop from last semester (watch recording) - Finite NLM always a GP (NN part is just feature transformation) - should the kernel still be the NTK kernel? - (Watch GP Workshop -> NLM Kernel/Feature map equivalence to GP) - NNGP paper: if joint training, proven not GP in infinite case - replicate the experiment in the UNA paper - NLM/NN - which ones are more comcplex? - typical basis is not complex enough to express uncertainty (UNA) - metrics / better/worse representations - uncertainties, representations, for meta learning etc. - transfer learning / meta learning - random idea - using neural network to output uncertainty - quantifying a good basis? - look at large/small NLM's and draw from prior dist. - Does complexity measures from ED also tell us about complexity of bases and measures of uncertainty? Can run experiments to test. - Todo: Bayesian deep ensemble paper Todo before Wednesday: ## Week 1 ### High-level Weekly Summary Reading papers Setting up research ### Sat. 1/29 - Created Glossary file - put related/common terminology in there! - look into code for [GPytorch](https://github.com/cornellius-gp/gpytorch/blob/master/examples/01_Exact_GPs/Simple_GP_Regression.ipynb) ### Fri. 1/28 Notes and resources from Fundamentals of Research with Weiwei: - [Slides](https://docs.google.com/presentation/d/1EQIupyrH7z2sUUH99CMQQJc9JzV8reTzQ6RYOeWQ1tE/edit#slide=id.g1108058f42b_0_30) - [Example Literature Review](https://hackmd.io/@onefishy/BJcPrhRLF) - [Example Weekly Presentation](https://docs.google.com/presentation/d/1q965TWgSV19v1rOOxtD4h6FoWNswtwBynDhxn-N6dUo/edit#slide=id.g7e13ba820d_0_55) - [Example Weekly Research Notes](https://hackmd.io/@onefishy/SkT7VZAat) - Ph.D. students: Yaniv, Jiayu, Beau (https://dtak.github.io/people/) - Moving this over to hackMD **Meeting Lucy/Geoffrey/Max 10:30am** Potential research directions we talked about: - architecture reasoning for NLM bad uncertainty in in-between region - linking UNA work to effective dimensionality Todo: - Look into motivations for NLM (Benchmarking NLM’s paper - active learning, RL, AutoML) - NLM can be interpreted as a GP with some covariance kernel? (BNN -> GP, other group’s paper) - Uncertainty tradeoff with speed - UNA and this paper on GPP - Relationship between parameterization and uncertainty - Potential reason for functional prior: if just put prior on weights, algorithms converge to same area -> functional equivalent - The paper addresses why NLM does not have nice uncertainty, because of architecture or the way we fit it? - UNA’s answer is training objective - We can look into the architecture reasoning for uncertainty UNA: We can look into other ways of forcing functional diversity (related to effective dimensionality) eg. force orthogonal bases (most important eigenfunctions capture the important part of the data), forcing high effective dimensionality - Eigenfunctions of GP and uncertainty of bases. For GPs basis functions are always orthogonal - Explicitly force orthogonal bases (but also looks like what LUNA is doing) Potential direction- how it connects to effective dimensionality Topics to read up on: - Eigenfunctions and GP (4.3 http://www.gaussianprocess.org/gpml/chapters/RW4.pdf) - GPytorch - Karhunen Loeve decompositon - McKay’s paper from parameter recounting - Take another look at BBVI - Training of NLM’s - Why do we have to learn mean and uncertainty together? ### Thurs. 1/27 Max - Read through some parts of the following papers: - Snoek (Scalable Bayesian Optimization) - Rethinking Parameter Counting - NLM - Bayesian Neural Networks from a Gaussian Process Perspective - https://www.youtube.com/watch?v=osEWq2sBobc ### Wed. 1/26 Max: put all papers into a papers folder in the drive Meeting Geoffrey/Max 1:00pm TODOs: Papers to check out/read: - Scalable Bayesian Optimization using Deep Neural Networks (This is the paper that introduces NLM’s, 2015) Rethinking Parameter Counting - Openreview: https://openreview.net/forum?id=eBHq5irt-tk - Look into NLM’s performances on other tasks, potentially - Understanding BBVI/BNN’s - PAC-Bayes - Neurips Competition https://izmailovpavel.github.io/neurips_bdl_competition/ ### Tues. 1/25 Papers recommended by Geoffrey: - Bayesian Deep Learning and a Probabilistic Perspective of Generalization - Surprises in High-Dimensional Ridgeless Least Squares Interpolation - Double-descent curves in neural networks: a new perspective using Gaussian processes Other Papers found on the way: - Benchmarking the Neural Linear Model for Regression - Kernel Methods in Machine Learning ### Mon. 1/24 Papers to read: - Uncertainty Aware Bases - Rethinking Parameter Counting Other papers recommended by Weiwei: - Latent Derivative Bayesian Last Layer Networks - An Empirical Study on The Properties of Random Bases for Kernel Methods - Do Wide and Deep Neural Networks learn the same things? **Research Meeting Notes**: NLM - Bayesian Kernel Regression with kernel learned by neural network - Because we are learning instead of choosing the basis - What kinds of basis are good for which tasks? Why? - ex tasks: OOD detection, transfer learning, generalization, etc… - metric that measures "goodness" of basis for a task? - Quantification of expressiveness of the model given the basis - "# of parameters" in model is not easy to think about in Bayesian - papers: Rethinking parameter counting in deep models - LUNA Paper shows not all basis are equal

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.