Humans choose the least complex hypothesis that fits the data well.
In Bayesian models both complexity and fitness are measured semantically. Complexity is measured by flexibility: the ability to generate a more diverse set of observations.
Since all probabilities should add up to 1, a complex model spreads its probabilities over a larger number of possibilities whereas a simple model will have high probabilities for a smaller set of events. Hence:
P(simple hypothesis | event) > P(complex hypothesis | event)
Of hypotheses which generate data uniformly, the one with smallest extension that is still consistent with the data is the most probable.
Consider the two possible hypotheses:
Let say we observe a sample to be a
. Which hypothesis is more likely? The smaller one! In fact the smaller hypothesis is twice as likely as the first one.
As we observe more data, how does our learning look like? Take a look here:
https://github.com/vinsis/math-and-ml-notes/blob/master/images/size_principle.svg
Now consider these two hypotheses:
And our observed data is:
The Bayesian Occam’s razor says that all else being equal the hypothesis that assigns the highest likelihood to the data will dominate the posterior. Because of the law of conservation of belief, assigning higher likelihood to the observed data requires assigning lower likelihood to other possible data.
Hence the observed data is much more likely to have come from the first hypothesis.
Given a set of points [(x,y)]
uniformly sampled from a rectangle, which rectangle is most likely?
By the same argument as above, the tightest fitting rectangle.
probabilistic-models-of-cognition