## A Research Question: Non-identifiable behaviour of meta-learned Bayesian inference. Let's say we train a transformer or an ARLM in general on exchangeable data (mixture of iids). This, essentially, meta-trains the model to perform implicit Bayesian inference. To say that the model performs implicit Bayesian inference means that, roughly, "when evaluated on data consistent with the pretraining mixture, it's predictions/completions are indistinguishable from (or very close to) what exact Bayesian inference would predict". However, how should we expect a model which performs implicit Bayesian inference in a Bayesian model behave when confronted with observations with respect to which the Bayesian model is misspecified, i.e. on prompts that are out-of-distribution? Continuing to perform (implicit) Bayesian inference in a misspecified model is, as it will be discussed below, suboptimal. It's concievable that the model does even better than Bayesian inference on OOD. Or that it matches Bayesian behaviour in the OOD setting. Or that it does absolute garbage in OOD setting. ## Little bit more formally So, we train an ARLM on a mixture distribution of the form $$ p(x_{1:T}) = \int \prod_{t=1}^Tp(x_{t}\vert \theta)d\pi(\theta) $$ and then we evaluate how good it is on next-token prediction on i.i.d. sequences sampled from $$ q(x_{1:T}) = \prod_{t=1}^Tq(x_t) $$ If $q$ is such that $\exists \theta: q(x_t) = p(x_t\vert \theta)$ we say that the Bayesian model is **well specified**, and we should expect that in this case, as $T$ grows, the prediction cross entropy should approach the entropy of $q$ (which is the lower bound). Thus, the Transformer is expected to be Bayes optimal in the limit of growing $T$. The interesting question is, perhaps, how does the Transformer behave when evaluated on a $q$, which none of the mixture components describe well, that is $\not\exists \theta: q(\cdot) = p(\cdot \vert \theta)$. This is what we call model misspecification, inasmuch as the Bayesian prior $\pi$ places zero probability on the actual data generating distribution. If the Transformer actually learns to perform exact Bayesian inference, and it continues to do that in this situation, it will be suboptimal under model misspecification. That is because the Bayesian posterior $p(\theta\vert x_{1\ldots T})$ will contract to a single point, and therefore a Bayesian transformer will make predictions based on a single mixture component $p(x_t\vert \theta^\star)$ where $\theta^\star$ is the maximum likelihood parameter, that is $$ \theta^{\star} = \operatorname{argmin}_\theta \operatorname{KL}[q(\cdot) \|p(\cdot \vert \theta)] $$ That, however, is not necessarily the best, or optimal behaviour in terms of cross entropy. Suppose there exists a distribution over $\theta$ values $\rho$ such that $q(\cdot) \approx \int p(\cdot\vert \theta) d\rho(\theta)$. In this case, $\rho$ is a more useful posterior than the actual Bayesian posterior which we obtain by applying the Bayes rule, and the model should be able to achieve a cross entropy close to the entropy of $q$. So whether the Transformer works well on a q with respect to which the exchangeable training distribution is misspecified is a non-identifiable or $\epsilon$-identifiable behaviour. ### Concrete example Let's say we have two tokens, 🌈 and 🦄. There are two mixture components: $$ p(x_t = 🌈 \vert \theta=1) = \frac{2}{3} $$ and $$ p(x_t = 🌈 \vert \theta=0) = \frac{1}{3} $$ we train a model on data sampled from this exchangeable data generating process. In each sequence, the ratio of 🌈s is either 2/3 or 1/3. Now evaluate the trained model on a sequence where 🌈s is 1/2. Can the model figure out what's going on. The longer the prompt, the smaller its probability is under the pretraining distribution, so the completion is actually underdetermined by the training distribution. **What will the model do?**