Let's say you see a series of heads when a coin is tossed. Your beliefs about the bias of the coin depend on two items:
One can measure the rate of learning (when the inferred belief of a learner comes close to the actual fact that the coin is biased).
If we set
fairPrior
to be 0.5, equal for the two alternative hypotheses, just 5 heads in a row are sufficient to favor the trick coin by a large margin. IffairPrior
is 99 in 100, 10 heads in a row are sufficient. We have to increasefairPrior
quite a lot, however, before 15 heads in a row is no longer sufficient evidence for a trick coin: even atfairPrior
= 0.9999, 15 heads without a single tail still weighs in favor of the trick coin. This is because the evidence in favor of a trick coin accumulates exponentially as the data set increases in size; each successive h flip increases the evidence by nearly a factor of 2.
Coin flips are i.i.d. This can be seen by conditioning the next flip result on the previous result.
Similarly the below program samples i.i.d:
However the below function does not:
This is because learning about the first word tells us something about the probs, which in turn tells us about the second word.
The samples are not i.i.d but are exchangeable
: the probability of a sequence of values remains the same if permuted into any order
.
de Finetti’s theorem
says that, under certain technical conditions, any exchangeable sequence can be represented as follows, for some latentPrior distribution and observation function f
:
Imagine an urn that contains some number of white and black balls. On each step we draw a random ball from the urn, note its color, and return it to the urn along with another ball of that color.
It can be shown that the distribution of samples is exchangeable: bbw
, bwb
, wbb
have the same probability; bww
, wbw
, wwb
as well.
Because the distribution is exchangeable, we know that there must be an alterative representation in terms of a latent quantity followed by independent samples. The de Finetti representation of this model is:
We sample a shared latent parameter – in this case, a sample from a Beta distribution – generating the sequence samples independently given this parameter.
A common pattern often used in building models is:
Here we see how a proper definition of priors is important to mimic how humans learn.
uniform(0,1)
as prior, we get 0.7 as MLE. Not the same as humans.uniform(0,1)
with beta(10,10)
. But then even when the coin shows 100/100 heads, most humans will believe the coin always shows heads. But the model shows the bias to be 0.9 (instead of 1).This model stubbornly believes the coin is fair until around 10 successive heads have been observed. After that, it rapidly concludes that the coin can only come up heads. The shape of this learning trajectory is much closer to what we would expect for humans.
An effect E can occur due to a cause C or a background effect. We want to find out, from observed evidence about the co-occurrence of events, attempt to infer the causal structure relating them.
probabilistic-models-of-cognition