Let's take a look at marbles problem earlier:
Bag1 → Unknown distribution
→ Sample1 → Color1
Bag2 → Unknown distribution
→ Sample2 → Color2
…
BagN → Unknown distribution
→ SampleN → ColorN
Here we know deterministically that Color1 came from Bag1 and so on. What if we remove this information?
[Bag1, Bag2, …, BagN] → Sample bag1
→ Sample Color1
[Bag1, Bag2, …, BagN] → Sample bag2
→ Sample Color2
…
[Bag1, Bag2, …, BagN] → Sample bag n
→ Sample Color n
Here we need to learn two things given some observed data:
Notice how the bag is selected uniformly at random:
Instead of assuming that a marble is equally likely to come from each bag, we could instead learn a distribution over bags where each bag has a different probability.
Problem: Given a document (assumed to be a bag of words) and a set of topics, classify the document as one of the topics.
Approach:
Each topic is associated with a distribution over words, and this distribution is drawn from a Dirichlet prior.
[Vocabulary] → [Topic1]
[Vocabulary] → [Topic2]
…
[Vocabulary] → [Topic k]
For each document, mixture weights over a set of
k
topics are drawn from a Dirichlet prior.
[Dirichlet prior] → [Topic1, Topic2, … Topic k] for a document
For each of the N topics (where N = length of the doc) drawn for the document, a word is sampled from the corresponding multinomial distribution.
For each word:
[Topic1, Topic2, … Topic k] → [Topic] → [Word] → Observe similarity with word in a document
Intuitively what's happening here is:
Image source: Latent Dirichlet Allocation (LDA) | NLP-guidance
M denotes the number of documents
N is number of words in a given document (document i has Ni words)
α is the parameter of the Dirichlet prior on the per-document topic distributions
β is the parameter of the Dirichlet prior on the per-topic word distribution
θi is the topic distribution for document i
zij is the topic for the j-th word in document i
Human perception is often skewed by our expectations. A common example of this is called categorical perception – when we perceive objects as being more similar to the category prototype than they really are. In phonology this is been particularly important and is called the perceptual magnet effect: Hearers regularize a speech sound into the category that they think it corresponds to. Of course this category isn’t known a priori, so a hearer must be doing a simultaneous inference of what category the speech sound corresponded to, and what the sound must have been.
The code is simple but the implications are deep.
Notice that the perceived value is the expectation of all values that are observed for a given value
. This means on average most values for a given value
tend to be on left or right side of value
. This is where the skewness comes in.
In the previous examples the number of categories was fixed. But that can be problematic.
The simplest way to address this problem, which we call unbounded models, is to simply place uncertainty on the number of categories in the form of a hierarchical prior.
Example: Inferring whether one or two coins were responsible for a set of outcomes (i.e. imagine a friend is shouting each outcome from the next room–“heads, heads, tails…”–is she using a fair coin, or two biased coins?).
We can extend the idea to a higher number of bags by using a Poisson distrbution
:
But is the number of categories infinite here? No!
In an unbounded model, there are a finite number of categories whose number is drawn from an unbounded prior distribution
An alternative is to use infinite mixture models.
Consider the discrete probability distribution:
[a,b,c,d]
where a+b+c+d=1
It can be interpreted as:
a / (a+b+c+d)
b / (b+c+d)
c / (c+d)
d / (d)
Note that the last number is always 1 and all other numbers are always between 0 and 1.
Conversely, we can convert any list with all but last entries between 0 and 1 and last entry as 1 like so:
[p,q,r,1]
means
Thus it could be converted into a discrete probability distribution of length 4.
But what if we never decide to stop. In other words what if all the numbers [p,q,r,s,...]
are between 0 and 1.
Whenever we sample from this distribution, it will stop at a random index between 1 and infinity.
This can be modeled as:
But how do we create this infinitely long array of probabilities? In other words how does one set a prior over an infinite set of bags?
Thus we can construct an infinite mixture model like so:
probabilistic-models-of-cognition