# Evaluating model ensembles
Let's say we have a family of predictive distributions $p(y\vert x; \theta)$ and we calculated a *posterior* over parameters $\rho(\theta)$. Now we have a test dataset $(x_n, y_n), n\leq N$ and we'd like to evaluate how good our posterior $\rho$ is.
There are three ways to do this:
### Extremely frequentist evaluation
This is the crudest possible way perhaps to evaluate our model distribution $\rho$. We calculate the average (frequentist) risk of a single $\theta$ sampled from $\rho$:
\begin{align}
\mathbb{E}_{\theta\sim\rho} \mathbb{E}_{(x, y)\sim \nu} \log p(y\vert x, \theta)
&\approx \mathbb{E}_{\theta\sim\rho} \frac{1}{N}\sum_{n=1}^N \log p(y_n \vert x_n, \theta)\\
&= \mathbb{E}_{\theta\sim\rho} \frac{1}{N} \log \prod_{n=1}^N p(y_n \vert x_n, \theta)
\end{align}
### Frequentist ensemble evaluation
A more reasonable way to evaluate $\rho$ is to calculate the average predictions for multiple models under $\rho$ and then calculate the risk of that average prediction.
\begin{align}
\mathbb{E}_{(x, y)\sim \nu} \log \mathbb{E}_{\theta\sim\rho} p(y\vert x, \theta)
&\approx \frac{1}{N}\sum_{n=1}^N \log \mathbb{E}_{\theta\sim\rho} p(y_n \vert x_n, \theta)\\
&= \frac{1}{N} \log \prod_{n=1}^N \mathbb{E}_{\theta\sim\rho} p(y_n \vert x_n, \theta)
\end{align}
The main difference of course is that now the averaging happens before the log loss is evaluated. The difference between the Masegosa posterior and the traditional variational posterior is that it optimises a bound on this, more meaningful, notion of risk. It's when applying the sharper Jensen bound to the difference between this and the previous notion of risk that the diversity term comes into the picture
### Discriminative Bayesian evaluation
If we are fully Bayesian about evbaluating $\rho$ we consider the likelihood of all the test labels simultaneously (conditioned on the test inputs locations).
$$
\log \mathbb{E}_{\theta\sim\rho} p(y_1, \ldots y_N \vert x_1, \ldots, x_N, \theta)
$$
Note that the frequentist notions of risk stat from an expectation under the true data distribution $\nu$ and then approximate it with a Monte Carlo average which is how the test dataset comes into play. In the Bayesian evaluation there is one training dataset, and we evaluate the pobability - under the average model - of this specific labels on the specific input locations.
If the model is such that the predictions are conditionally independent given $\rho$ we have that:
$$
\log \mathbb{E}_{\theta\sim\rho} p(y_1, \ldots y_N \vert x_1, \ldots, x_N, \theta) = \log \mathbb{E}_{\theta\sim\rho} \prod_{n=1}^N p(y_n \vert x_n, \theta)
$$
The reason why I called this discriminative Bayesian is that we're still evaluating the likelihood of labels $y$ conditioned on inputs $x$, and not the joint likelihood of $x$ and $y$ which could be called generative Bayesian evaluation.
### The difference between the three
Notice that the three evaluation metrics differ in where the expecation with respect to $\rho$ is taken:
* outside the log
* inside the product
* or between the log and the product
### Illustrating why the joint likelihood (full Bayesian) and the frequentis ensemble risk differ
Consider this posterior/ensemble:

Now let's say we have two test datapoints, one at $x_1=-6.5$ and another one at x = $x_2=+6.5$ So we are looking at two slices of the posterior predictive:

In the hand-drawn figures below I show the marginal predictive distributions $p(y_1\vert x_1)$ and $p(y_2\vert x_2)$ (top) as well as the joint predictive distribution $p(y_1, y_2 \vert x_1, x_2)$.

With blue, I plot the ensemble members, which each contribute an isotropic Gaussian with a relatively narrow variance. Red shows the mixture of these Gaussians. We can see that these ensemble members have been optimised so that when predicting both $y_1$ and $y_2$, the Gaussians combine to a wider predictive distribution thus allowing the ensemble to make better predictions than any single member. However, when we look at the joint distribution of $y_1$ and $y_2$, the ensemble members are placed such that overall they combine to a joint distribution where there seems to be a negative correlation between the two labels. The joint distribution of the two labels is not actually very good, so when we're evaluating this model in the full Bayesian way, we might get high log loss for labels that violate this anti-correlation seen here.
(Sorry for the acciddental teenage boy drawing in the resulting joint distribution :D)