Chapter 9: Learning as conditional inference

Learning and the rate of learning

Let's say you see a series of heads when a coin is tossed. Your beliefs about the bias of the coin depend on two items:

How likely it is to see a biased coin?
How much data have you seen?

One can measure the rate of learning (when the inferred belief of a learner comes close to the actual fact that the coin is biased).

var fairnessPosterior = function(observedData) {
  return Infer({method: 'enumerate'}, function() {
    var fair = flip(0.999)
    var coin = Bernoulli({p: fair ? 0.5 : 0.95})
    var obsFn = function(datum){observe(coin, datum == 'h')}
    mapData({data: observedData}, obsFn)
    return fair
  })
}

If we set fairPrior to be 0.5, equal for the two alternative hypotheses, just 5 heads in a row are sufficient to favor the trick coin by a large margin. If fairPrior is 99 in 100, 10 heads in a row are sufficient. We have to increase fairPrior quite a lot, however, before 15 heads in a row is no longer sufficient evidence for a trick coin: even at fairPrior = 0.9999, 15 heads without a single tail still weighs in favor of the trick coin. This is because the evidence in favor of a trick coin accumulates exponentially as the data set increases in size; each successive h flip increases the evidence by nearly a factor of 2.

Independent and Exchangeable Sequences

Coin flips are i.i.d. This can be seen by conditioning the next flip result on the previous result.

Similarly the below program samples i.i.d:

var words = ['chef', 'omelet', 'soup', 'eat', 'work', 'bake', 'stop']
var probs = [0.0032, 0.4863, 0.0789, 0.0675, 0.1974, 0.1387, 0.0277]
var thunk = function() {return categorical({ps: probs, vs: words})};

However the below function does not:

var words = ['chef', 'omelet', 'soup', 'eat', 'work', 'bake', 'stop']
var probs = (flip() ?
             [0.0032, 0.4863, 0.0789, 0.0675, 0.1974, 0.1387, 0.0277] :
             [0.3699, 0.1296, 0.0278, 0.4131, 0.0239, 0.0159, 0.0194])
var thunk = function() {return categorical({ps: probs, vs: words})};

This is because learning about the first word tells us something about the probs, which in turn tells us about the second word.

The samples are not i.i.d but are exchangeable: the probability of a sequence of values remains the same if permuted into any order.

de Finetti’s theorem says that, under certain technical conditions, any exchangeable sequence can be represented as follows, for some latentPrior distribution and observation function f:

var latent = sample(latentPrior)
var thunk = function() {return f(latent)}
var sequence = repeat(2,thunk)

Polya's urn

Imagine an urn that contains some number of white and black balls. On each step we draw a random ball from the urn, note its color, and return it to the urn along with another ball of that color.

It can be shown that the distribution of samples is exchangeable: bbw, bwb, wbb have the same probability; bww, wbw, wwb as well.

Because the distribution is exchangeable, we know that there must be an alterative representation in terms of a latent quantity followed by independent samples. The de Finetti representation of this model is:

var urn_deFinetti = function(urn, numsamples) {
  var numWhite = sum(map(function(b){return b=='w'},urn))
  var numBlack = urn.length - numWhite
  var latentPrior = Beta({a: numWhite, b: numBlack})
  var latent = sample(latentPrior)
  return repeat(numsamples, function() {return flip(latent) ? 'b' : 'w'}).join("")
}

var urnDist = Infer({method: 'forward', samples: 10000},
                    function(){return urn_deFinetti(['b', 'w'],3)})

viz(urnDist)

We sample a shared latent parameter – in this case, a sample from a Beta distribution – generating the sequence samples independently given this parameter.

Ideal learners

A common pattern often used in building models is:

Infer({...}, function() {
  var hypothesis = sample(prior)
  // obsFn will usually contain an observe function
  var obsFn = function(datum){...uses hypothesis...}
  mapData({data: observedData}, obsFn)
  return hypothesis
});

Learning a continuous parameter

Here we see how a proper definition of priors is important to mimic how humans learn.

When a coin shows 7/10 heads, most humans will still believe it to be fair. But if we used uniform(0,1) as prior, we get 0.7 as MLE. Not the same as humans.
We can replace uniform(0,1) with beta(10,10). But then even when the coin shows 100/100 heads, most humans will believe the coin always shows heads. But the model shows the bias to be 0.9 (instead of 1).
This can be remedied by having a small bias for a fair coin in the prior:

var weightPosterior = function(observedData){
  return Infer({method: 'MCMC', burn:1000, samples: 10000}, function() {
    var isFair = flip(0.999)
    var realWeight = isFair ? 0.5 : uniform({a:0, b:1})
    var coin = Bernoulli({p: realWeight})
    var obsFn = function(datum){observe(coin, datum=='h')}
    mapData({data: observedData}, obsFn)
    return realWeight
  })
}

This model stubbornly believes the coin is fair until around 10 successive heads have been observed. After that, it rapidly concludes that the coin can only come up heads. The shape of this learning trajectory is much closer to what we would expect for humans.

Another example: estimating causal power

An effect E can occur due to a cause C or a background effect. We want to find out, from observed evidence about the co-occurrence of events, attempt to infer the causal structure relating them.

var observedData = [{C:true, E:true}, {C:true, E:true}, {C:false, E:false}, {C:true, E:true}]

var causalPowerPost = Infer({method: 'MCMC', samples: 10000}, function() {
  // Causal power of C to cause E
  var cp = uniform(0, 1)

  // Background probability of E
  var b = uniform(0, 1)

  var obsFn = function(datum) {
    // The noisy causal relation to get E given C
    var E = (datum.C && flip(cp)) || flip(b)
    condition( E == datum.E)
  }

  mapData({data: observedData}, obsFn)

  return {causal_power: cp}
});

viz(causalPowerPost);

Chapter 9: Learning as conditional inference

Learning and the rate of learning

Independent and Exchangeable Sequences

Polya's urn

Ideal learners

Learning a continuous parameter

Another example: estimating causal power

tags: probabilistic-models-of-cognition

Read more

How to implement a GNN

Metaformer is Actually What You Need for Vision [CVPR 2022]

Patches are all you need?

MLP-Mixer: An All-MLP Architecture for Vision

tags: `probabilistic-models-of-cognition`