changed 5 years ago
Linked with GitHub

Bayesian Network

Why Bayesian Network?

Automated reasoning in CS

Since very early days of Computer Science, much work has been done to develop computer programs to solve problems which requires a high level of intelligence: artificial intelligence.

First, we have a formal deductive system proposed by John McCarthy (1959) like this

Knowledge base (KB) consists a HUGE set of logical statements written in suitable logic. Inference Engine deduce conclusions based on knowledge in KB and observations. This system dominates early days of AI, focusing on KB and logic:

The limits of formal deduction: common sense reasoning and non-monotonicity

In everyday life, it is common to modify our believes when new observations are made. For example:

When we see a bird, we believe that it can fly. However, if we take a closer look and realize that it has a broken wing. We retract our belief and instead believes that it cannot fly.

HOWEVER, deductive logic cannot do this simple thing

Classical logical systems can never invalidate the deduction by acquiring more knowledge due to the axiom below:

If \(\Delta\) logically implies \(\alpha\), then \(\Delta\land\Gamma\) also logically implies \(\alpha\) for any \(\Gamma\).

This difficulty leads to two main strains of development:
(1) a new generation of logic: non-monotonic logics (Gardenfors, 1988)
(2) probabilistic reasoning (Pearl, 1988), which leads to Bayesian networks

Bayesian Networks

A Bayesian network consists of two parts: a qualitative structure corresponding to a directed acyclic graph (DAG), and a quantative part corresponding to the strength of relationship between variables in the DAG.

For a formal deductive system, a similar system is still adopted when Bayesian models are used. But:
Statements in Logic \(\Rightarrow\) Bayesian networks
Logical Deduction \(\Rightarrow\) laws of probability theory

Basics of Probability Theory

I assume people here know the most basic idea of discrete probability:

For a given variable \(X\) with 4 possible values \(x_1, x_2, x_3, x_4\):
Pr\((x_1)\) = 0.31, Pr\((x_2)\) = 0.15, Pr\((x_3)\) = 0.24 ,Pr\((x_4)\) = 0.30

Joint Probability Distribution

Joint probability is the probability of multiple events/variables/propositions etc.

Example 1:
Tossing an even die. Let event \(x\) be "the outcome is even", let event \(y\) be "the outcome is smaller than 4".

Pr\((x) = \frac{1}{2}\), Pr\((y) = \frac{1}{2}\)
Pr\((x\cap y) =\) Pr(event: the outcome is 2) = \(\frac{1}{6}\neq\) Pr\((x)*\)Pr\((y)\)

Example 2:

world Earthquake Burgalary Alarm Pr\(()\)
\(w_1\) 1 1 1 0.0190
\(w_2\) 1 1 0 0.0010
\(w_3\) 1 0 1 0.0560
\(w_4\) 1 0 0 0.0240
\(w_5\) 0 1 1 0.1620
\(w_6\) 0 1 0 0.0180
\(w_7\) 0 0 1 0.0072
\(w_8\) 0 0 0 0.7128

So based on the table above:

\(Pr(earthquake = true) = Pr(w_1) + Pr(w_2) + Pr(w_3) + Pr(w_4) = 0.1\)
\(Pr(earthquake = true\cap alarm = true) = Pr(w_1) + Pr(w_5) = 0.181\)
\(Pr(earthquake = true \cap burgalary = false) = Pr(w_3) + Pr(w_4) = 0.08\)

Conditional Probability

Conditional probability provides us with a way to reason about the outcome of an event based on partial information.

\[Pr(x\mid y) = \frac{Pr(x\cap y)}{Pr(y)}\]

Example 1:

Given that "the outcome is even", what's the probability that "the outcome is smaller than 4"?
Pr\((x\mid y)\)
\(= \frac{Pr(x\cap y)}{Pr(y)}\)
\(= \frac{1}{6}/\frac{1}{2}\)
\(= \frac{1}{3}\)

On the other hand, we can also ask: Given that "the outcome is smaller than 4", what's the probability that "the outcome is even"?

Pr\((y\mid x)\)
\(= \frac{Pr(y\cap x)}{Pr(x)}\)
\(= \frac{1}{6}/\frac{1}{2}\)
\(= \frac{1}{3}\)

Bayes' Rule

\(Pr(x\mid y) = \frac{Pr(x\cap y)}{Pr(y)} = \frac{Pr(y\cap x)}{Pr(y)} = \frac{Pr(x)P(y\mid x)}{Pr(y)}\Rightarrow\)
\[Pr(x\mid y) = \frac{Pr(x)Pr(y\mid x)}{Pr(y)}\]

prior probability: \(Pr(x)\)
posterior probability: \(Pr(x\mid y)\)

Bayes' rule is often used for inference. Suppose we know the causal relationship between the effect \(y\) and possible causes \(x_1...x_n\), \(\textit{i.e.}\) we know the probability that \(y\) occurs when the cause, \(x_i\), happens:
\[Pr(y\mid x_i)\] is given.

When the effect is observed, we can infer the probability that the cause \(x_i\) happens, which helps identify the cause behind the effect:
\[Pr(x_i\mid y) \]

To compute Pr\((x_i\mid y)\), we need Pr\((x_i)\), Pr\((y\mid x_i)\), and Pr\((y)\).

Sometimes Pr\((y)\) is directly given, but sometimes not. Then we can compute Pr\((y)\) by:
\(P(y) = P(x_1)P(y\mid x_1) + P(x_2)P(y\mid x_2) + ... P(x_n)P(y\mid x_n)\)

Independence

An interesting and important special case arises when the occurrence of \(y\) provides no such information and does not alter the probability that \(x\) has occured.
\(x\) is independent of \(y\): \[P(x\mid y) = P(x)\]

Since \(P(x\mid y) = \frac{P(x\cap y)}{P(y)} = P(x)\), \(P(x\cap y) = P(x)P(y).\) Now we have:
\[P(x\mid y) = P(x) \Leftrightarrow P(x\cap y) = P(x)P(y)\]

Conditional Independence

We say two events \(x_1, x_2\) are conditional independent if
\[P(x_1\cap x_2\mid y) = P(x_1\mid y)P(x_2\mid y)\]

A more intuitive formula can convey conditional independence:
\[P(x_1\mid x_2\cap y) = P(x_1\mid y)\]
In words, this relation states that if \(y\) is known to have occur, the additional knowledge that \(x_2\) occurs does not change the probability about \(x_1\).

What is a Bayesian Network?

\(Q_3\) Pr(\(Q_3\)) \(Q_4\) Pr(\(Q_4\)) \(C\) Pr(\(C\)) \(P\) Pr(\(P\))
true 0.5 true 0.5 true 0.001 true 0.001
false 0.5 false 0.5 false 0.999 false 0.999
\(Q_3\) \(Q_4\) E Pr(\(E\mid Q_3, Q_4\)) \(Q_4\) \(J\) Pr(\(J\mid Q_4\))
true true true 1 true true 0.99
true false true 0 false true 0.20
false true true 0 true false 0.01
false false true 0 false false 0.80
true true false 0
true false false 1
false true false 1
false false false 1
\(C\) \(E\) \(R\) Pr(\(R\mid C, E\)) \(P\) \(R\) \(O\) Pr(\(O\mid P, R\))
true true true 0 true true true 0
true false true 0 true false true 0
false true true 1 false true true 1
false false true 0 false false true 0
true true false 1 true true false 1
true false false 1 true false false 1
false true false 0 false true false 0
false false false 1 false false false 1

Qualitative Structure

The structure of a Bayesian Network indicates the causal relationships between events/variables. For example:
(1) Having \(Q_4\) (question 4 being correct) correct contributes to/causes getting \(E\) (Earned A)
(2) Both \(P\) (Perception Error) and \(R\) (Reported "A") lead to \(O\) (Obersering "A")
(3) \(C\) (clerical error) and \(Q_3\) are irrelevant
(4)

A DAG \((V, E)\) consist a set of variables and a set of edges.

variables:
Representattion of the primitive propositions that we deem relevant to our domain

edges:
Representation of the information about the dependencies between variables, which can be understood as causal relations.

Quantative Probabilities

In addition to the structural information, we also need quantative information about the "strength" of the relationships between events \(\textit{i.e.}\) how strongly they are related. The strength can be

For example, this is a conditional probability table for the variable \(E\).

Definition of Bayesian Network

A Bayesian Network for variables \(Z\) is a pair \((G, \Theta)\), where:
(1) \(G\) is a directed acyclic graph over variables \(Z\), called the network \(\textit{structure}\).
(2) \(\Theta\) is a set of conditional probability tables, one for each variable in \(Z\), called the network \(\textit{parametrization}\).

Independence Encoded in Bayesian Network

Bayesian network can represent different events and causal relationship between events, can they also encode which variables are independent from each other?

A Bayesian Network encodes the structure of state of knowledge. Propositions, events that are causally related are represented by an arrow between variables in the graph; those that are not related are elicited independent in the graph.

Obvious from this graph,
(1) \(O\) is dependent on both \(P\) and \(R\), \(\textit{i.e.}\) whether there is a perception error and whether "A" is reported affect the probability that "A" is observed.
(2) \(J\) is dependent on \(Q_4\), \(\textit{i.e.}\) if question 4 is answered, Jack would confirm the answer.
(3) \(Q_3\) and \(Q_4\) are independent, of course.
(4)

What about other dependence/independence relations?
Strategy 1: calculate from the parameters \(\theta\)s
Two variables \(x, y\) are independent if and only if \(P(x\mid y) = P(x)\)
Two variables \(x, y\) are conditional independent given some other variable \(z\) if and only if \(P(x\mid y, z) = P(x\mid z)\)

Strategy 2: See from the graph
Markovian Assumtion of DAG:
\[I(V, Parents(V), Non\_Descendants(V)) for\,all\, variables\,V\,in\,DAG\]

This is saying that for any variable \(V\), given parents of \(V\), \(V\) is independent from all variables that are not its descendents.


Given \(P\) and \(R\), \(O\) is independent of all other variables since \(O\) does not have descendants.


Given parents of \(E\) (not shown in this fig), \(E\) is independent of its non-descendants such as \(C, P\) and etc.

There are theorems proving that the results from the two strategies are equivalent! Bayesian Network can represent a complicated probabilistic distribution of a bunch of variables!

SO BAYESIAN NETWORK IS GOOD!

How to construct a Bayesian Network?

The construction of a Bayesian network involves three major steps:

  1. pick a set of relevant variables and their possible a prior values
  2. build the network structure adding arrows between the variables
  3. define the conditional probability table for each network variable

Picking variables

There are three types of variables: query variables, evidence variables, intermediate variables. Usually, query variables do not have parents, evidence variables do not have descendants. Intermediate variables have both parents and descendants.

Query variables:

Evidence variables:

Intermediate variables:

To determine query variables, we need to identify those events about which we need to ask questions.

To determine evidence variables, we need to identify those events about which we can collect information.

Intermediate levels are less strict, adding them when necessary and useful.

Adding edges

Edges in Bayesian Networks represent the causal relationshipts between variables. For each variable \(x\), to find its parents, we ask "what is the set of variables that we regard as tthe direct causes of \(x\)"? To find its descendants, we ask "what does \(x\) cause or have influence on?"

This knowledge usually comes from expert knowledge.

Constructing CPTs

CPTs fall into two categories:

  1. CPTs for conditional probabilities such as Pr(\(x\mid y\))
  2. CPTs for a prior single query variables such as Pr(\(y\))

For first category:

For second category:

CPT knowledge is much harder to build than other knowledge. It usually comes from collecting data.

What is Bayesian Network used for? INFERENCE!

There are at least four general types of queries that can be posed with respect to Bayesian network:

  1. probability of some specific set of variables
  2. posteror query given observation
  3. maximum a posteriori hypothesis (MAP)
  4. most probable explanation (MPE)

Evaluation of specific set of values of variables

This is one of the simplest queries that can be made. The computation is completely followed from probabilities and conditional probability tables given.

For example, let's ask what's the probability for an \(A\) to be observed: Pr\((A = true)\).

Capital letters are variables, non-capital letters are values, which is either \(true\) or \(false\) here.

Pr(\(A\))
= \(\Sigma_{p, r, c, e, q_3, q_4, j}\)Pr(\(A = true\mid P = p, R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\))Pr(\(P = p, R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\)) (Marginalization)
= \(\Sigma_{p, r, c, e, q_3, q_4, j}\)Pr(\(A = true\mid P = p, R = r\))Pr(\(P = p, R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\)) (Markovian Assumption)
= \(\Sigma_{p, r, c, e, q_3, q_4, j}\)Pr(\(A = true\mid P = p, R = r\))Pr(\(P = p \mid R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\))Pr(\(R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\)) (Chain rule)
= \(\Sigma_{p, r, c, e, q_3, q_4, j}\)Pr(\(A = true\mid P = p, R = r\))Pr(\(P\))Pr(\(R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j\)) (Markovian Assumption)
= (using chain rule in probability theory)

Computation Tools: SAMLAM

Posterior probabilities

The most common query in Bayesian network: when some evidence is observed, how the prior probabilities of other variables can be changed? Bayes' Rule can be used here.

Assume we know:
(1) a prior knowledge about \(\theta\), which is a probabilistic distribution \(P(\theta)\)
(2) the "causal" relationship between \(\theta\) and the occurrence of events: \(P(x\mid\theta)\)
\(\Rightarrow P(\theta\mid x) = \frac{P(\theta)P(x\mid \theta)}{P(x)}\) where \(P(x_i) = \Sigma_{\theta'}P(\theta')P(x\mid\theta)\)

For example, it is observed that \(J\) is false, \(\textit{i.e.}\) Jack didn't confirm the answer. Then what about the probability of \(Q_4\) being true or false?

Based on Bayes' rule:
Pr(\(Q_4 = true\mid J = false\))
= Pr(\(Q_4 = true\))Pr(\(J = false\mid Q_4 = true)/\)Pr\((J = false)\)
= 0.5*0.20/0.405 (all the values are copied from a prior probability tables)
= 0.247

Maximum a posteriori hypothesis (MAP)

Given what we know about some evidence \(x\), what is most likely event to happen?

For example, if we saw that John confirmed the answer, \(\textit{i.e.}\) \(J\) is \(true\), is \(Q_4\) likely to be \(true\) or not?

Formally, given the value \(x\) of observation, we select a value \(\theta'\) of \(\theta\) that maximizes the posterior distribution \(P(\theta\mid x)\), maximizing the posterior probability:
\[\theta' = argmax_{\theta'}P(\theta\mid x)\]

In our example:
\(P(Q_4 = true\mid J = true)\)
\(= (P(Q_4= true)*P(J=true\mid Q_4 = true))/P(J=true)\)
\(= (P(Q_4= true)*P(J=true\mid Q_4 = true))/0.595\)
\(= (0.5*0.99)/1 = 0.832\)

\(P(Q_4 = false\mid J = true)\)
\(= (P(Q_4= false)*P(J=true\mid Q_4 = false))/P(J=true)\)
\(= (P(Q_4= false)*P(J=true\mid Q_4 = false))/0.595\)
\(= (0.2*0.99)/1 = 0.328\)

Since 0.832 > 0.328, \(Q_4\) is likely to be \(true\).

So given that John confirmed the answer, we can confidently infer that question 4 should be answered right.

Most probable explanation

Given the observation, what's the most likely explanation? The goal here is to identify the most probable instantiation of network variables given some evidence.

MPE is a sub-problem of MAP
MPE: given some evidence, ask for the joint probability of all other variables
MAP: given some evidence, ask for the probability of some other variables

Given observed values for some variables, what is the configuration of the whole Bayesian Network that has the highest probability?

Formally, given observed evidence \(x\), MPE refers to the instantiation of all other variables such that their joint probability is highest possible:
\[MPE(x) = argmax_{v_1,v_2...v_3}Pr(V_1=v_1, V_2=v_2...\mid E = x)\]

For example, if "A" is observed, \(\textit{i.e.}\) \(O = true\), how to assign the values for all other variables?
MPE(\(O=true\)) = \(argmax_{p, r, c, e, j, q_3, q_4}\)Pr(\(P = p, R = r, C = c, E = e, J = j, Q_3 = q_3, Q_4 = q_4\mid A = true\))

It's too hard to compute so

Select a repo