# S22 Model Selection Summary
TL;DR: Given two models that aren't fully nested, things like the Bayes Factor behave weirdly. How do we (1) examine this weird behavior experimentally (2) quantify the "nestedness" of models (3) get better selection criteria?
## Background
Consider two competing models, $M_1$ and $M_2$. They belong to different models. We want to use them to describe data $D$.
The **Bayes Factor** is $BF=\frac{P(D|M_1)}{P(D|M_2)}=\frac{P(M_1|D)}{P(M_2|D)}\cdot \frac{P(M_2)}{P(M_1)}$, considering all possible parameters in the models. If equal prior probabilities are assigned to both models, this equals $\frac{P(M_1|D)}{P(M_2|D)}$, the **posterior odds ratio**.
Closely related metrics include the Bayesian Information Criterion and Akaike information criterion (**BIC**, **AIC**), which are scores for each of the models. (Choose the smaller score. If there are $n$ data points and $k$ estimated parameters, BIC is $k\ln n-\ln P(D|M)$ and AIC is $2k-2\ln P(D|M)$, where $P(D|M)$ is evaluated at the MLE.)
We want to use metrics to choose the "better" model. This means, when one model can better describe the data-generating process DGP (e.g. when the DGP *is* generated using that model), our metric chooses that model. And when both models describe the data equally well, we pick the simpler one that uses less parameters.
$M_1$ and $M_2$ are **nested** if one can be described using the other by restricting parameters. Chib and Kuffner (2016) show that BF is consistent in this case, but showed that BIC is general inconsistent in non-nested settings. Casella et al. (2009) report no general consistency result in non-nested settings. Hong and Preston (2005) show that BF cannot perform consistency. Giron et al. (2010) show consistency but only under technical restrictions with linear models and using *intrinsic priors*.
An additional question is, does the BF still work well when the models are *partially* overlapping (e.g. partially-shared function basis)? How can we quantify the degree of overlap?
## Setup
We use PyTorch for all code. All data used here is of the form $(x,y)$, and we try to predict $y$ using $x$. All $x$ lies in $[-1, 1]$, either randomly at uniform or evenly spaced. We model $y$ as generated via a noisy basis transformation $y=w^T \phi(x)+\epsilon$, $\epsilon\sim \mathcal{N}(0,\sigma_{out}^2)$, $\sigma_{out}^2$ fixed. When calculating the BF, our priors are equally balanced (so it's the same as the posterior odds ratio).
Different models use different feature transformations $\phi$ which induce different functional bases. We also experiment with the function space created by a NLM trained on the data (taking all but the last layer as the feature representation).
All our models are based on Gaussian linear regression, just with different feature representations/parameter ranges, i.e. $w\sim \mathcal{N}(\mu_w,\Sigma_w)$ with conjugate update $w|D\sim \mathcal{N}(\hat{\mu}_w, \hat{\Sigma}_w)$, where $\hat{\Sigma}_w=(2(\phi(x))^T\phi(x)+\Sigma_w^{-1})^{-1}$, $\hat{\mu}=2\hat{\Sigma}_2\phi(x)^T y$. This gives simple ways to calculate $P(D|M_1)$.
We believe that there *should* exist a closed-form solution for the expected BF. If $M_1$ and $M_2$ induce probability distributions $p_1$ and $p_2$ and an unknown model $M^*$ induces $p^*$, then $E_{y\sim P(y|x,M^*)}(BF(M_1, M_2))=D_{KL}(p^*\|p_1)+D_{KL}(p^*\|p_2)$ based on definitions of entropy. If $M^*=M_1$, then this reduces to $D_{KL}(p_1\|p_2)$. "Expected Bayes Factor" diagrams below use this calculation.
The 2D setting allows us to visualize the **functional diversity** of a model. Below is the diversity visualization for an NLM feature basis; a series of linear bases ($w_0+w_1x$), where the prior for $w_1$ is centered over certain intervals, or equivalently angle ranges (scenario A); polynomials of increasing degree ranges and standardized variance over $[-1, 1]$ (scenario B); and sinusoidal bases formed from random Fourier samples with normed variance, where we increase the number of frequencies sampled (scenario C).




To minimize numerical instability we can work with $\log BF$ instead of $BF$. One way to empirically measure the accuracy of the Bayes Factor is to **record its log values** over several simulations and calculate its average. We can also look at its **success rate**. To run a simulation, we can generate data using one instantiation from $M_1$, i.e. choose $w\sim p(w|M_1)$ and generate $y\sim p(y|x,w,M_1)$, calculate BF, then see the *proportion* of trials in which it is greater than $1$ (or its log is positive; note this doesn't tell us the *magnitude* of the BF).
## Quantifying Model Overlap
How do we measure the degree to which two models are overlapping? In Scenario A discussed under the Setup section in which our weights $w_1$ took on different ranges of weights, we would expect two models whose ranges are closer together to have better overlap.
Start by considering the **prior predictive cross-entropy** (PPCE), $E_{y\sim p(y|x,M_1)}(p(y|x,M_2))$. (The **posterior predictive cross-entropy** takes the expectation of $p(y|x,D,M_2)$ instead, where $M_2$ has been updated with the data.) This is nice because it gives a closed-form solution for multivariate Gaussian distributions: $E_{x\sim \mathcal{N}(\mu_1, \Sigma_1)}(\mathcal{N}(x; \mu_2, \Sigma_2))=\mathcal{N}(\mu_2;\mu_1, \Sigma_1+\Sigma_2)$; we can similarly derive a closed-form for the Bayes factor. Calculating the prior predictive cross-entropy on the Scenario A examples mentioned gives a promising plot:

Also brought up were analyzing the similarity of variances in Taylor decompositions and Fourier decompositions. The former works well as expected when dealing with polynomial feature transformations (scenario B), but acts very erratically when dealing with NLMs. Fourier decompositions have not yet been explored in-depth.

Recently we turned our attention to testing other metrics that might also convey overlap. Consider a simple setting where we have two one-dimensional probability distributions $p$ and $q$, and we want to determine the extent to which $q$ is nested inside of $p$. Intuitively, if the support of the PDF of one is visually contained more within the other, we should say that the models have higher overlap, i.e. some function $H(q,p)$ should be maximized/minimized.
To test this, we generate a few Gaussian distributions with different means and variances and calculate (a) the prior predictive cross-entropies between $p$ and $q$ in both directions, the proportion of the distribution of $q$ that has a significant probability under $p$, $E_{x\sim p}(I(q(x)>\epsilon))$, and the **normed prior predictive cross-entropy**, the PPCE scaled by the ratio of the two distributions' modes (so intuitively, we "scale" $q$ so that the max heights of the two distributions match): $-E_{x\sim p} \left(\log\left(q(x)\cdot \frac{p(\mu_p)}{q(\mu_q)}\right)\right)$.

Empirically, we find that the cross-entropy still works the best when intuitively quantifying our sense of overlap. However, experimentally the normed PPCE still lines up best with the BF success rate. Notice these metrics are all directional (non-symmetric). To make them symmetric, we could simply add $H(p,q)$ and $H(q,p)$.
## Experiments with Bayes Factor
Consider five different model classes: lines with no bias ($w_1 x$), lines, cubics, and two NLMs with output layers of size $2$ and $64$. The BF success rate behaves interestingly and doesn't match up with the cross-entropy.



In the examples below, we consider polynomials of increasing degree bases (similar to scenario B). $d_1$ is the model drawn from for the DGP and we measure its success rate when compared to models fit from $d_2$, when (1) both use polynomial bases of the specified number of degrees ($1,x,x^2,\dots$); (2) $d_1$ uses an even polynomial basis and $d_2$ uses an odd polynomial basis, and (3) vice versa. The source models are plotted on the vertical axis and the alternative models are plotted on the horizontal axis (so on the diagonal, we compare the same model against itself; in the top left, our DGP comes from a degree 50 basis and we evaluate the BF when compared against low-degree models).



The odd and non-symmetric plots of the success factors runs counter to our intuition, since the latter two situations should be identical: we are essentially doing the same thing by comparing two non-nested models, one of which is the DGP. We believe that a reason for this is because we're limiting our data into the range $[-1,1]$, which might limit the actual diversity of functions and lead to a disparity in the ability of odd polynomials to fit even ones in the range and vice versa.
Zooming into a subset of the first setting, we also find that the BF success rate doesn't correlate with cross-entropies (above), and a similar pattern holds even after we adjust the prior over the weights to have unit variance, accounting for dimension (below).


When working with linear models with priors at different angles (scenario A), we see the following more well-behaved results which seem to match up with the prior predictive cross-entropy.

## Comparing the Bayes Factor with Overlap Metrics
We further dive into looking at how well metrics like the prior predictive cross-entropy match up with the BF success rate. Consider working with models under Scenario C. We restrict our attention to the PPCE and normed PPCE. First look at the case where models are nested (i.e. we sample the same frequencies). Similar to the above polynomial case, we notice that the BF success rate isn't symmetric: it still tends to prefer simpler models, even when the DGP is the more complex one. This manifests as a "smudge" of lower sucess rates above the diagonal. (We conjecture that as we increase the number of data points, this directional effect will vanish, as simpler models can no longer fit the larger set of data points well.)

This doesn't match up visually with any of the metrics discussed above. To measure this more quantitatively, we plot the BF success rates against values of the PPCE and normed PPCE, which give the following figures:


When our models are overlapping but not fully nested, and also strictly non-nested, we also find the following plots, calculated analogously. We control the nestedness by sampling different frequencies from the Fourier basis. For each series of heatmaps, the first plot is the metric $H(p,q)$, the second is $H(q,p)$, and the third is the BF success rate.
* Overlapping models under prior predictive cross-entropy:


* Overlapping models under normed prior predictive cross-entropy:


* Completely non-nested models under prior predictive cross-entropy:


* Completely non-nested models under normed prior predictive cross-entropy:


There are a few observations we can draw by looking at these plots in tandem. First, as the models become less nested the BF success rate behaves more erratically, with the "smudge" along the diagonal completely disappearing for non-nested models. In non-nested cases there does not seem to be a clear pattern, which shouldn't be surprising given the inconsistency qualities described in the literature.
In general, under both the PPCE and normed PPCE we find that when the metric is not large (from inspection, approximately less than $1000$), there is no clear relationship between the metric value and BF success rate. This isn't very helpful: the majority of situations fell in a position where that relationship isn't so strong. We however do begin to see a linear relationship when our metric produces values of larger magnitude, which are clearly associated with higher values of the success rate (since if we have more nestedness, the BF behaves better).
Possible next steps to mull are:
* Experimenting with even more different model classes.
* Experimenting with data ranges beyond $[-1, 1]$, or working with multivariate data ($x\in \mathbb{R}^d$).
* Measuring the effectiveness of the BF with a metric other than just its success rate, e.g. the average log BF (which has been displayed in one or two examples above).
* Synthesizing the findings above to give recommendations or a heuristic for model selection in these ambiguous overlapping/non-nested situations, possibly by recommending an alternate metric.