Try   HackMD

Chapter 12: Occam's Razor

Humans choose the least complex hypothesis that fits the data well.

  • How is complexity measured? How is fitness measured?
    If fitness is semantically measured and complexity is syntactically measured (eg description length of the hypothesis in some representation language, or a count of the number of free parameters used to specify the hypothesis), the two are incommensurable.

In Bayesian models both complexity and fitness are measured semantically. Complexity is measured by flexibility: the ability to generate a more diverse set of observations.

Key Idea: The Law of Conservation of Belief

Since all probabilities should add up to 1, a complex model spreads its probabilities over a larger number of possibilities whereas a simple model will have high probabilities for a smaller set of events. Hence:

P(simple hypothesis | event) > P(complex hypothesis | event)


The Size Principle

Of hypotheses which generate data uniformly, the one with smallest extension that is still consistent with the data is the most probable.

Consider the two possible hypotheses:

Categorical({vs: ['a', 'b', 'c', 'd', 'e', 'f'], ps: [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]}) :
Categorical({vs: ['a', 'b', 'c'], ps: [1/3, 1/3, 1/3]}));

Let say we observe a sample to be a. Which hypothesis is more likely? The smaller one! In fact the smaller hypothesis is twice as likely as the first one.

var fullData = ['a', 'b', 'a', 'b', 'b', 'a', 'b']

As we observe more data, how does our learning look like? Take a look here:

https://github.com/vinsis/math-and-ml-notes/blob/master/images/size_principle.svg

Example 2:

Now consider these two hypotheses:

Categorical({vs: ['a', 'b', 'c', 'd'], ps: [0.375, 0.375, 0.125, 0.125]}) :
Categorical({vs: ['a', 'b', 'c', 'd'], ps: [0.25, 0.25, 0.25, 0.25]}))

And our observed data is:

var observedData = ['a', 'b', 'a', 'b', 'c', 'd', 'b', 'b']

The Bayesian Occam’s razor says that all else being equal the hypothesis that assigns the highest likelihood to the data will dominate the posterior. Because of the law of conservation of belief, assigning higher likelihood to the observed data requires assigning lower likelihood to other possible data.

Hence the observed data is much more likely to have come from the first hypothesis.

var hypothesisToDist = function(hypothesis) {
  return (hypothesis == 'A' ?
          Categorical({vs: ['a', 'b', 'c', 'd'], ps: [0.375, 0.375, 0.125, 0.125]}) :
          Categorical({vs: ['a', 'b', 'c', 'd'], ps: [0.25, 0.25, 0.25, 0.25]}))
}

var observedData = ['a', 'b', 'a', 'b', 'c', 'd', 'b', 'b']

var posterior = Infer({method: 'enumerate'}, function(){
  var hypothesis = flip() ? 'A' : 'B'
  mapData({data: observedData}, function(d){observe(hypothesisToDist(hypothesis),d)})
  return hypothesis
})

viz(posterior)

Example: The Rectangle Game

Given a set of points [(x,y)] uniformly sampled from a rectangle, which rectangle is most likely?

By the same argument as above, the tightest fitting rectangle.

tags: probabilistic-models-of-cognition