Since very early days of Computer Science, much work has been done to develop computer programs to solve problems which requires a high level of intelligence: artificial intelligence.
First, we have a formal deductive system proposed by John McCarthy (1959) like this …
Knowledge base (KB) consists a HUGE set of logical statements written in suitable logic. Inference Engine deduce conclusions based on knowledge in KB and observations. This system dominates early days of AI, focusing on KB and logic:
In everyday life, it is common to modify our believes when new observations are made. For example:
When we see a bird, we believe that it can fly. However, if we take a closer look and realize that it has a broken wing. We retract our belief and instead believes that it cannot fly.
HOWEVER, deductive logic cannot do this simple thing …
Classical logical systems can never invalidate the deduction by acquiring more knowledge due to the axiom below:
If \(\Delta\) logically implies \(\alpha\), then \(\Delta\land\Gamma\) also logically implies \(\alpha\) for any \(\Gamma\).
This difficulty leads to two main strains of development:
(1) a new generation of logic: non-monotonic logics (Gardenfors, 1988)
(2) probabilistic reasoning (Pearl, 1988), which leads to Bayesian networks
A Bayesian network consists of two parts: a qualitative structure corresponding to a directed acyclic graph (DAG), and a quantative part corresponding to the strength of relationship between variables in the DAG.
For a formal deductive system, a similar system is still adopted when Bayesian models are used. But:
Statements in Logic \(\Rightarrow\) Bayesian networks
Logical Deduction \(\Rightarrow\) laws of probability theory
I assume people here know the most basic idea of discrete probability:
For a given variable \(X\) with 4 possible values \(x_1, x_2, x_3, x_4\):
Pr\((x_1)\) = 0.31, Pr\((x_2)\) = 0.15, Pr\((x_3)\) = 0.24 ,Pr\((x_4)\) = 0.30
Joint probability is the probability of multiple events/variables/propositions etc.
Example 1:
Tossing an even die. Let event \(x\) be "the outcome is even", let event \(y\) be "the outcome is smaller than 4".
Pr\((x) = \frac{1}{2}\), Pr\((y) = \frac{1}{2}\)
Pr\((x\cap y) =\) Pr(event: the outcome is 2) = \(\frac{1}{6}\neq\) Pr\((x)*\)Pr\((y)\)
Example 2:
world | Earthquake | Burgalary | Alarm | Pr\(()\) |
---|---|---|---|---|
\(w_1\) | 1 | 1 | 1 | 0.0190 |
\(w_2\) | 1 | 1 | 0 | 0.0010 |
\(w_3\) | 1 | 0 | 1 | 0.0560 |
\(w_4\) | 1 | 0 | 0 | 0.0240 |
\(w_5\) | 0 | 1 | 1 | 0.1620 |
\(w_6\) | 0 | 1 | 0 | 0.0180 |
\(w_7\) | 0 | 0 | 1 | 0.0072 |
\(w_8\) | 0 | 0 | 0 | 0.7128 |
So based on the table above:
\(Pr(earthquake = true) = Pr(w_1) + Pr(w_2) + Pr(w_3) + Pr(w_4) = 0.1\)
\(Pr(earthquake = true\cap alarm = true) = Pr(w_1) + Pr(w_5) = 0.181\)
\(Pr(earthquake = true \cap burgalary = false) = Pr(w_3) + Pr(w_4) = 0.08\)
…
Conditional probability provides us with a way to reason about the outcome of an event based on partial information.
\[Pr(x\mid y) = \frac{Pr(x\cap y)}{Pr(y)}\]
Example 1:
Given that "the outcome is even", what's the probability that "the outcome is smaller than 4"?
Pr\((x\mid y)\)
\(= \frac{Pr(x\cap y)}{Pr(y)}\)
\(= \frac{1}{6}/\frac{1}{2}\)
\(= \frac{1}{3}\)
On the other hand, we can also ask: Given that "the outcome is smaller than 4", what's the probability that "the outcome is even"?
Pr\((y\mid x)\)
\(= \frac{Pr(y\cap x)}{Pr(x)}\)
\(= \frac{1}{6}/\frac{1}{2}\)
\(= \frac{1}{3}\)
\(Pr(x\mid y) = \frac{Pr(x\cap y)}{Pr(y)} = \frac{Pr(y\cap x)}{Pr(y)} = \frac{Pr(x)P(y\mid x)}{Pr(y)}\Rightarrow\)
\[Pr(x\mid y) = \frac{Pr(x)Pr(y\mid x)}{Pr(y)}\]
prior probability: \(Pr(x)\)
posterior probability: \(Pr(x\mid y)\)
Bayes' rule is often used for inference. Suppose we know the causal relationship between the effect \(y\) and possible causes \(x_1...x_n\), \(\textit{i.e.}\) we know the probability that \(y\) occurs when the cause, \(x_i\), happens:
\[Pr(y\mid x_i)\] is given.
When the effect is observed, we can infer the probability that the cause \(x_i\) happens, which helps identify the cause behind the effect:
\[Pr(x_i\mid y) \]
To compute Pr\((x_i\mid y)\), we need Pr\((x_i)\), Pr\((y\mid x_i)\), and Pr\((y)\).
Sometimes Pr\((y)\) is directly given, but sometimes not. Then we can compute Pr\((y)\) by:
\(P(y) = P(x_1)P(y\mid x_1) + P(x_2)P(y\mid x_2) + ... P(x_n)P(y\mid x_n)\)
An interesting and important special case arises when the occurrence of \(y\) provides no such information and does not alter the probability that \(x\) has occured.
\(x\) is independent of \(y\): \[P(x\mid y) = P(x)\]
Since \(P(x\mid y) = \frac{P(x\cap y)}{P(y)} = P(x)\), \(P(x\cap y) = P(x)P(y).\) Now we have:
\[P(x\mid y) = P(x) \Leftrightarrow P(x\cap y) = P(x)P(y)\]
We say two events \(x_1, x_2\) are conditional independent if
\[P(x_1\cap x_2\mid y) = P(x_1\mid y)P(x_2\mid y)\]
A more intuitive formula can convey conditional independence:
\[P(x_1\mid x_2\cap y) = P(x_1\mid y)\]
In words, this relation states that if \(y\) is known to have occur, the additional knowledge that \(x_2\) occurs does not change the probability about \(x_1\).
\(Q_3\) | Pr(\(Q_3\)) | \(Q_4\) | Pr(\(Q_4\)) | \(C\) | Pr(\(C\)) | \(P\) | Pr(\(P\)) |
---|---|---|---|---|---|---|---|
true | 0.5 | true | 0.5 | true | 0.001 | true | 0.001 |
false | 0.5 | false | 0.5 | false | 0.999 | false | 0.999 |
\(Q_3\) | \(Q_4\) | E | Pr(\(E\mid Q_3, Q_4\)) | \(Q_4\) | \(J\) | Pr(\(J\mid Q_4\)) |
---|---|---|---|---|---|---|
true | true | true | 1 | true | true | 0.99 |
true | false | true | 0 | false | true | 0.20 |
false | true | true | 0 | true | false | 0.01 |
false | false | true | 0 | false | false | 0.80 |
true | true | false | 0 | |||
true | false | false | 1 | |||
false | true | false | 1 | |||
false | false | false | 1 |
\(C\) | \(E\) | \(R\) | Pr(\(R\mid C, E\)) | \(P\) | \(R\) | \(O\) | Pr(\(O\mid P, R\)) |
---|---|---|---|---|---|---|---|
true | true | true | 0 | true | true | true | 0 |
true | false | true | 0 | true | false | true | 0 |
false | true | true | 1 | false | true | true | 1 |
false | false | true | 0 | false | false | true | 0 |
true | true | false | 1 | true | true | false | 1 |
true | false | false | 1 | true | false | false | 1 |
false | true | false | 0 | false | true | false | 0 |
false | false | false | 1 | false | false | false | 1 |
The structure of a Bayesian Network indicates the causal relationships between events/variables. For example:
(1) Having \(Q_4\) (question 4 being correct) correct contributes to/causes getting \(E\) (Earned A)
(2) Both \(P\) (Perception Error) and \(R\) (Reported "A") lead to \(O\) (Obersering "A")
(3) \(C\) (clerical error) and \(Q_3\) are irrelevant
(4) …
A DAG \((V, E)\) consist a set of variables and a set of edges.
variables:
Representattion of the primitive propositions that we deem relevant to our domain
edges:
Representation of the information about the dependencies between variables, which can be understood as causal relations.
In addition to the structural information, we also need quantative information about the "strength" of the relationships between events \(\textit{i.e.}\) how strongly they are related. The strength can be
For example, this is a conditional probability table for the variable \(E\).
A Bayesian Network for variables \(Z\) is a pair \((G, \Theta)\), where:
(1) \(G\) is a directed acyclic graph over variables \(Z\), called the network \(\textit{structure}\).
(2) \(\Theta\) is a set of conditional probability tables, one for each variable in \(Z\), called the network \(\textit{parametrization}\).
Bayesian network can represent different events and causal relationship between events, can they also encode which variables are independent from each other?
A Bayesian Network encodes the structure of state of knowledge. Propositions, events that are causally related are represented by an arrow between variables in the graph; those that are not related are elicited independent in the graph.
Obvious from this graph,
(1) \(O\) is dependent on both \(P\) and \(R\), \(\textit{i.e.}\) whether there is a perception error and whether "A" is reported affect the probability that "A" is observed.
(2) \(J\) is dependent on \(Q_4\), \(\textit{i.e.}\) if question 4 is answered, Jack would confirm the answer.
(3) \(Q_3\) and \(Q_4\) are independent, of course.
(4) …
What about other dependence/independence relations?
Strategy 1: calculate from the parameters \(\theta\)s
Two variables \(x, y\) are independent if and only if \(P(x\mid y) = P(x)\)
Two variables \(x, y\) are conditional independent given some other variable \(z\) if and only if \(P(x\mid y, z) = P(x\mid z)\)
Strategy 2: See from the graph
Markovian Assumtion of DAG:
\[I(V, Parents(V), Non\_Descendants(V)) for\,all\, variables\,V\,in\,DAG\]
This is saying that for any variable \(V\), given parents of \(V\), \(V\) is independent from all variables that are not its descendents.
Given \(P\) and \(R\), \(O\) is independent of all other variables since \(O\) does not have descendants.
Given parents of \(E\) (not shown in this fig), \(E\) is independent of its non-descendants such as \(C, P\) and etc.
There are theorems proving that the results from the two strategies are equivalent! Bayesian Network can represent a complicated probabilistic distribution of a bunch of variables!
SO BAYESIAN NETWORK IS GOOD!
The construction of a Bayesian network involves three major steps:
There are three types of variables: query variables, evidence variables, intermediate variables. Usually, query variables do not have parents, evidence variables do not have descendants. Intermediate variables have both parents and descendants.
Query variables:
Evidence variables:
Intermediate variables:
To determine query variables, we need to identify those events about which we need to ask questions.
To determine evidence variables, we need to identify those events about which we can collect information.
Intermediate levels are less strict, adding them when necessary and useful.
Edges in Bayesian Networks represent the causal relationshipts between variables. For each variable \(x\), to find its parents, we ask "what is the set of variables that we regard as tthe direct causes of \(x\)"? To find its descendants, we ask "what does \(x\) cause or have influence on?"
This knowledge usually comes from expert knowledge.
CPTs fall into two categories:
For first category:
For second category:
CPT knowledge is much harder to build than other knowledge. It usually comes from collecting data.
There are at least four general types of queries that can be posed with respect to Bayesian network:
This is one of the simplest queries that can be made. The computation is completely followed from probabilities and conditional probability tables given.
For example, let's ask what's the probability for an \(A\) to be observed: Pr\((A = true)\).
Capital letters are variables, non-capital letters are values, which is either \(true\) or \(false\) here.
Pr(\(A\))
= \(\Sigma_{p, r, c, e, q_3, q_4, j}\)Pr(\(A = true\mid P = p, R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\))Pr(\(P = p, R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\)) (Marginalization)
= \(\Sigma_{p, r, c, e, q_3, q_4, j}\)Pr(\(A = true\mid P = p, R = r\))Pr(\(P = p, R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\)) (Markovian Assumption)
= \(\Sigma_{p, r, c, e, q_3, q_4, j}\)Pr(\(A = true\mid P = p, R = r\))Pr(\(P = p \mid R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\))Pr(\(R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\)) (Chain rule)
= \(\Sigma_{p, r, c, e, q_3, q_4, j}\)Pr(\(A = true\mid P = p, R = r\))Pr(\(P\))Pr(\(R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\)) (Markovian Assumption)
= … (using chain rule in probability theory)
Computation Tools: SAMLAM
The most common query in Bayesian network: when some evidence is observed, how the prior probabilities of other variables can be changed? Bayes' Rule can be used here.
Assume we know:
(1) a prior knowledge about \(\theta\), which is a probabilistic distribution \(P(\theta)\)
(2) the "causal" relationship between \(\theta\) and the occurrence of events: \(P(x\mid\theta)\)
\(\Rightarrow P(\theta\mid x) = \frac{P(\theta)P(x\mid \theta)}{P(x)}\) where \(P(x_i) = \Sigma_{\theta'}P(\theta')P(x\mid\theta)\)
For example, it is observed that \(J\) is false, \(\textit{i.e.}\) Jack didn't confirm the answer. Then what about the probability of \(Q_4\) being true or false?
Based on Bayes' rule:
Pr(\(Q_4 = true\mid J = false\))
= Pr(\(Q_4 = true\))Pr(\(J = false\mid Q_4 = true)/\)Pr\((J = false)\)
= 0.5*0.20/0.405 (all the values are copied from a prior probability tables)
= 0.247
Given what we know about some evidence \(x\), what is most likely event to happen?
For example, if we saw that John confirmed the answer, \(\textit{i.e.}\) \(J\) is \(true\), is \(Q_4\) likely to be \(true\) or not?
Formally, given the value \(x\) of observation, we select a value \(\theta'\) of \(\theta\) that maximizes the posterior distribution \(P(\theta\mid x)\), maximizing the posterior probability:
\[\theta' = argmax_{\theta'}P(\theta\mid x)\]
In our example:
\(P(Q_4 = true\mid J = true)\)
\(= (P(Q_4= true)*P(J=true\mid Q_4 = true))/P(J=true)\)
\(= (P(Q_4= true)*P(J=true\mid Q_4 = true))/0.595\)
\(= (0.5*0.99)/1 = 0.832\)
\(P(Q_4 = false\mid J = true)\)
\(= (P(Q_4= false)*P(J=true\mid Q_4 = false))/P(J=true)\)
\(= (P(Q_4= false)*P(J=true\mid Q_4 = false))/0.595\)
\(= (0.2*0.99)/1 = 0.328\)
Since 0.832 > 0.328, \(Q_4\) is likely to be \(true\).
So given that John confirmed the answer, we can confidently infer that question 4 should be answered right.
Given the observation, what's the most likely explanation? The goal here is to identify the most probable instantiation of network variables given some evidence.
MPE is a sub-problem of MAP
MPE: given some evidence, ask for the joint probability of all other variables
MAP: given some evidence, ask for the probability of some other variables
Given observed values for some variables, what is the configuration of the whole Bayesian Network that has the highest probability?
Formally, given observed evidence \(x\), MPE refers to the instantiation of all other variables such that their joint probability is highest possible:
\[MPE(x) = argmax_{v_1,v_2...v_3}Pr(V_1=v_1, V_2=v_2...\mid E = x)\]
For example, if "A" is observed, \(\textit{i.e.}\) \(O = true\), how to assign the values for all other variables?
MPE(\(O=true\)) = \(argmax_{p, r, c, e, j, q_3, q_4}\)Pr(\(P = p, R = r, C = c, E = e, J = j, Q_3 = q_3, Q_4 = q_4\mid A = true\))
It's too hard to compute so …