Bayesian cognitive model and Bayesian data analysis are the same thing under the hood; they just have different contexts.
If the generative model is a hypothesis about a person’s model of the world, then we have a Bayesian cognitive model – the main topic of this book. If the generative model is instead the scientist’s model of how the data are generated, then we have Bayesian data analysis.
Consider an example of a spinning coin (as opposed to a flipping coin).
Which hypothesis is correct? How do we design an experiment? A Bayesian way of answering this question is: given some data, what is the probability that it was generated by hypothesis one?
Imagine we ran the “predict the next 10” experiment with 20 participants, and observed the following responses:
var experimentalData = [9,8,7,5,4,5,6,7,9,4,8,7,8,3,9,6,5,7,8,5]
Note that in this case the observed data came from people who tend to support hypothesis two. Hence in the above program's result, hypothesis two will be supported more.
For a given Bayesian model (together with data), there are four conceptually distinct distributions of interest:
The posterior predictive distribution describes what data you should expect to see, given the model you’ve assumed and the data you’ve collected so far. If the model is a good description of the data you’ve collected, then the model shouldn’t be surprised if you got the same data by running the experiment again.
It’s natural then to use the posterior predictive distribution to examine the descriptive adequacy of a model. If these predictions do not match the data already seen (i.e., the data used to arrive at the posterior distribution over parameters), the model is descriptively inadequate.
Question: we arrive at the posterior distribution using the data already seen. Then under what circumstances can the posterior predictive distribution different from the distribution of data already seen?
Here is an example of when that can happen:
You collect data from ten students each from two schools whether or not they are likely to help other students.
The posterior distribution of parameter will peak around 0.5. But the posterior predictive distribution generated from this posterior will look nothing like the observed data. Thus, the model is descriptively inadequate.
Let's say you toss a coin 20 times and get 7 heads. You want to know if the coin is unbiased (perfectly random) or not. Let's concretely define the terms biased and unbiased.
p(head) = 0.5
is a simple model. p(head) = Uniform(0,1)
is more complex. It means each time a coin is tossed, it will turn up heads with any probability between 0 and 1. All probabilities are equally likely. Note that p(head) = 0.5
is just a subset of this complex model.
Which model do you think is more likely to generate the observed data (7/20 heads)?
One might think it is the biased coin but it is actually the unbiased coin.
Shouldn’t the more general model always be better? If we’re at a track, and you bet on horse A, and I bet on horse A and B, aren’t I strictly in a better position than you? The answer is no, and the reason has to do with our metric for winning. Intuitively, we don’t care whether your horse won or not, but how much money you win. How much money you win depends on how much money you bet, and the rule is, when we go to track, we have the same amount of money.
In probabilistic models, our money is probabilities. Each model must allocate its probability so that it sums to 1. So my act of betting on horse A and horse B actually requires me to split my money (say, betting 50 / 50 on each). On the other hand, you put all your money on horse A (100 on A, 0 on B). If A wins, you will gain more money because you put more money down.
This idea is called the principle of parsimony or Occam’s razor
Bayes factor is simple the ratio A / B where
A = Marginal likelihood of data under the hypothesis in consideration
B = Marginal likelihood of data under all hypotheses
As you can see, B can be hard to calculate.
Savage-Dickey method comes to the rescue:
The Bayes factor can also be obtained by considering only the more complex hypothesis. What you do is look at the distribution over the parameter of interest (here, p) at the point of interest (here, p = 0.5). Dividing the probability density of the posterior by the density of the prior (of the parameter at the point of interest) also gives you the Bayes Factor!
Note that we have approximated the densities by looking at the expectation that p is within 0.05 of the target value p=0.5.
Here we only consider the complex model (which includes the simple model). We only look at the behavior of complex model before and after the data is observed at our point of interest (which was represented earlier by our simple model).
The concepts and ideas explained above are used in another example: Regression using Bayesian inference.
Given the data of a tournament of tug of war which contains wins and strengths of players, we build a model like so:
strength = β0 + β1 * nwins
The observed data of strength is generated from strength
like so:
observed strength = N(strength, σ)
Here β0, β1 and σ are unknowns. We define a prior over these parameters, observe the data and get a posterior.
As before, we want to find out whether this model is descriptively adequate. This is done by generating samples from this model and comparing them with actual data.
It is seen that the generated data has 91% correlation with the observed data.
probabilistic-models-of-cognition