owned this note
owned this note
Published
Linked with GitHub
---
tags: MathLing_presentation
---
Bayesian Network
===
<!-- table of content -->
[TOC]
<!--
css: number headings,
choose none or one
-->
{%hackmd 4StIcxzCSv--MU1mxH8cQg %}
# Why Bayesian Network?
## Automated reasoning in CS
Since very early days of Computer Science, much work has been done to develop computer programs to solve problems which requires a high level of intelligence: artificial intelligence.
First, we have a **formal deductive system** proposed by John McCarthy ([1959](https://www.sciencedirect.com/science/article/pii/S0049237X09700990)) like this ...
![](https://i.imgur.com/j6riSrN.png)
Knowledge base (KB) consists a HUGE set of logical statements written in suitable logic. Inference Engine deduce conclusions based on knowledge in KB and observations. This system dominates early days of AI, focusing on **KB and logic**:
![](https://i.imgur.com/QaLPjMr.png)
## The limits of formal deduction: common sense reasoning and non-monotonicity
In everyday life, it is common to modify our believes when new observations are made. For example:
> When we see a bird, we believe that it can fly. However, if we take a closer look and realize that it has a broken wing. We **retract** our belief and instead believes that it cannot fly.
HOWEVER, deductive logic cannot do this simple thing ...
Classical logical systems can never invalidate the deduction by acquiring more knowledge due to the axiom below:
If $\Delta$ logically implies $\alpha$, then $\Delta\land\Gamma$ also logically implies $\alpha$ for any $\Gamma$.
This difficulty leads to two main strains of development:
(1) a new generation of logic: non-monotonic logics ([Gardenfors, 1988](https://dl.acm.org/citation.cfm?id=1029726))
(2) probabilistic reasoning ([Pearl, 1988](https://www.sciencedirect.com/book/9780080514895/probabilistic-reasoning-in-intelligent-systems)), which leads to Bayesian networks
## Bayesian Networks
![](https://i.imgur.com/Qt0GWC1.png)
A Bayesian network consists of two parts: a qualitative structure corresponding to a directed acyclic graph (DAG), and a quantative part corresponding to the strength of relationship between variables in the DAG.
For a formal deductive system, a similar system is still adopted when Bayesian models are used. But:
Statements in Logic $\Rightarrow$ Bayesian networks
Logical Deduction $\Rightarrow$ laws of probability theory
![](https://i.imgur.com/49cwkDZ.png)
# Basics of Probability Theory
I assume people here know the most basic idea of discrete probability:
For a given variable $X$ with 4 possible values $x_1, x_2, x_3, x_4$:
Pr$(x_1)$ = 0.31, Pr$(x_2)$ = 0.15, Pr$(x_3)$ = 0.24 ,Pr$(x_4)$ = 0.30
## Joint Probability Distribution
Joint probability is the probability of multiple events/variables/propositions etc.
Example 1:
Tossing an even die. Let event $x$ be "the outcome is even", let event $y$ be "the outcome is smaller than 4".
Pr$(x) = \frac{1}{2}$, Pr$(y) = \frac{1}{2}$
Pr$(x\cap y) =$ Pr(event: the outcome is 2) = $\frac{1}{6}\neq$ Pr$(x)*$Pr$(y)$
Example 2:
| world | Earthquake | Burgalary | Alarm | Pr$()$ |
| -------- | -------- | -------- | -------- | --------|
| $w_1$ | 1 | 1 | 1 | 0.0190 |
| $w_2$ | 1 | 1 | 0 | 0.0010 |
| $w_3$ | 1 | 0 | 1 | 0.0560 |
| $w_4$ | 1 | 0 | 0 | 0.0240 |
| $w_5$ | 0 | 1 | 1 | 0.1620 |
| $w_6$ | 0 | 1 | 0 | 0.0180 |
| $w_7$ | 0 | 0 | 1 | 0.0072 |
| $w_8$ | 0 | 0 | 0 | 0.7128 |
So based on the table above:
$Pr(earthquake = true) = Pr(w_1) + Pr(w_2) + Pr(w_3) + Pr(w_4) = 0.1$
$Pr(earthquake = true\cap alarm = true) = Pr(w_1) + Pr(w_5) = 0.181$
$Pr(earthquake = true \cap burgalary = false) = Pr(w_3) + Pr(w_4) = 0.08$
...
## Conditional Probability
Conditional probability provides us with a way to reason about the outcome of an event based on partial information.
$$Pr(x\mid y) = \frac{Pr(x\cap y)}{Pr(y)}$$
Example 1:
![](https://i.imgur.com/TC8rQ20.png =300x220)
Given that "the outcome is even", what's the probability that "the outcome is smaller than 4"?
Pr$(x\mid y)$
$= \frac{Pr(x\cap y)}{Pr(y)}$
$= \frac{1}{6}/\frac{1}{2}$
$= \frac{1}{3}$
On the other hand, we can also ask: Given that "the outcome is smaller than 4", what's the probability that "the outcome is even"?
Pr$(y\mid x)$
$= \frac{Pr(y\cap x)}{Pr(x)}$
$= \frac{1}{6}/\frac{1}{2}$
$= \frac{1}{3}$
## Bayes' Rule
$Pr(x\mid y) = \frac{Pr(x\cap y)}{Pr(y)} = \frac{Pr(y\cap x)}{Pr(y)} = \frac{Pr(x)P(y\mid x)}{Pr(y)}\Rightarrow$
$$Pr(x\mid y) = \frac{Pr(x)Pr(y\mid x)}{Pr(y)}$$
prior probability: $Pr(x)$
posterior probability: $Pr(x\mid y)$
Bayes' rule is often used for inference. Suppose we know the causal relationship between the effect $y$ and possible causes $x_1...x_n$, $\textit{i.e.}$ we know the probability that $y$ occurs when the cause, $x_i$, happens:
$$Pr(y\mid x_i)$$ is given.
When the effect is observed, we can infer the probability that the cause $x_i$ happens, which helps identify the cause behind the effect:
$$Pr(x_i\mid y) $$
To compute Pr$(x_i\mid y)$, we need Pr$(x_i)$, Pr$(y\mid x_i)$, and Pr$(y)$.
Sometimes Pr$(y)$ is directly given, but sometimes not. Then we can compute Pr$(y)$ by:
$P(y) = P(x_1)P(y\mid x_1) + P(x_2)P(y\mid x_2) + ... P(x_n)P(y\mid x_n)$
## Independence
An interesting and important special case arises when the occurrence of $y$ provides no such information and does not alter the probability that $x$ has occured.
$x$ is independent of $y$: $$P(x\mid y) = P(x)$$
Since $P(x\mid y) = \frac{P(x\cap y)}{P(y)} = P(x)$, $P(x\cap y) = P(x)P(y).$ Now we have:
$$P(x\mid y) = P(x) \Leftrightarrow P(x\cap y) = P(x)P(y)$$
### Conditional Independence
We say two events $x_1, x_2$ are conditional independent if
$$P(x_1\cap x_2\mid y) = P(x_1\mid y)P(x_2\mid y)$$
A more intuitive formula can convey conditional independence:
$$P(x_1\mid x_2\cap y) = P(x_1\mid y)$$
In words, this relation states that if $y$ is known to have occur, the additional knowledge that $x_2$ occurs does not change the probability about $x_1$.
# What is a Bayesian Network?
![](https://i.imgur.com/aNEUMVa.png)
$Q_3$| Pr($Q_3$) | $Q_4$ | Pr($Q_4$) | $C$ | Pr($C$)|$P$|Pr($P$)
--- | --- | --- | --- | --- | --- |---|---
true | 0.5 | true | 0.5 | true | 0.001 |true|0.001
false|0.5 | false | 0.5 |false | 0.999 |false|0.999
| $Q_3$ | $Q_4$ | E | Pr($E\mid Q_3, Q_4$) | $Q_4$ | $J$ | Pr($J\mid Q_4$) |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| true | true | true | 1 | true | true | 0.99 |
| true | false | true | 0 | false | true | 0.20 |
| false | true | true | 0 | true | false | 0.01 |
| false | false | true | 0 | false | false | 0.80 |
| true | true | false | 0 | | | |
| true | false | false | 1 | | | |
| false | true | false | 1 | | | |
| false | false | false | 1 | | | |
| $C$ | $E$ | $R$ | Pr($R\mid C, E$) | $P$ | $R$ | $O$ | Pr($O\mid P, R$) |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| true | true | true | 0 | true | true | true | 0 |
| true | false | true | 0 | true | false | true | 0 |
| false | true | true | 1 | false | true | true | 1 |
| false | false | true | 0 | false | false | true | 0 |
| true | true | false | 1 | true | true | false | 1 |
| true | false | false | 1 | true | false | false | 1 |
| false | true | false | 0 | false | true | false | 0 |
| false | false | false | 1 | false | false | false | 1 |
## Qualitative Structure
The structure of a Bayesian Network indicates the causal relationships between events/variables. For example:
(1) Having $Q_4$ (question 4 being correct) correct contributes to/causes getting $E$ (Earned A)
(2) Both $P$ (Perception Error) and $R$ (Reported "A") lead to $O$ (Obersering "A")
(3) $C$ (clerical error) and $Q_3$ are irrelevant
(4) ...
A DAG $(V, E)$ consist a set of variables and a set of edges.
variables:
Representattion of the primitive propositions that we deem relevant to our domain
edges:
Representation of the information about the dependencies between variables, which can be understood as causal relations.
## Quantative Probabilities
In addition to the structural information, we also need quantative information about the "strength" of the relationships between events $\textit{i.e.}$ how strongly they are related. The strength can be
For example, this is a conditional probability table for the variable $E$.
![](https://i.imgur.com/0o57s63.png =300x120)
## Definition of Bayesian Network
A Bayesian Network for variables $Z$ is a pair $(G, \Theta)$, where:
(1) $G$ is a directed acyclic graph over variables $Z$, called the network $\textit{structure}$.
(2) $\Theta$ is a set of conditional probability tables, one for each variable in $Z$, called the network $\textit{parametrization}$.
## Independence Encoded in Bayesian Network
Bayesian network can represent different events and causal relationship between events, can they also encode which variables are independent from each other?
A Bayesian Network encodes the structure of state of knowledge. Propositions, events that are causally related are represented by an arrow between variables in the graph; those that are not related are elicited independent in the graph.
![](https://i.imgur.com/ZnaxMAm.png)
Obvious from this graph,
(1) $O$ is **dependent** on both $P$ and $R$, $\textit{i.e.}$ whether there is a perception error and whether "A" is reported affect the probability that "A" is observed.
(2) $J$ is **dependent** on $Q_4$, $\textit{i.e.}$ if question 4 is answered, Jack would confirm the answer.
(3) $Q_3$ and $Q_4$ are **independent**, of course.
(4) ...
What about other dependence/independence relations?
**Strategy 1:** calculate from the parameters $\theta$s
Two variables $x, y$ are independent if and only if $P(x\mid y) = P(x)$
Two variables $x, y$ are conditional independent given some other variable $z$ if and only if $P(x\mid y, z) = P(x\mid z)$
**Strategy 2:** See from the graph
Markovian Assumtion of DAG:
$$I(V, Parents(V), Non\_Descendants(V)) for\,all\, variables\,V\,in\,DAG$$
This is saying that for any variable $V$, given parents of $V$, $V$ is independent from all variables that are not its descendents.
![](https://i.imgur.com/8ugADiN.png =400x200)
Given $P$ and $R$, $O$ is independent of all other variables since $O$ does not have descendants.
![](https://i.imgur.com/vV9S0WC.png =400x300)
Given parents of $E$ (not shown in this fig), $E$ is independent of its non-descendants such as $C, P$ and etc.
> There are theorems proving that the results from the two strategies are equivalent! Bayesian Network can represent a complicated probabilistic distribution of a bunch of variables!
>
> **SO BAYESIAN NETWORK IS GOOD!**
# How to construct a Bayesian Network?
The construction of a Bayesian network involves three major steps:
1. pick a set of relevant variables and their possible a prior values
1. build the network structure adding arrows between the variables
1. define the conditional probability table for each network variable
## Picking variables
There are three types of variables: **query variables, evidence variables, intermediate variables**. Usually, query variables do not have parents, evidence variables do not have descendants. Intermediate variables have both parents and descendants.
Query variables:
![](https://i.imgur.com/a2bFHXK.png)
Evidence variables:
![](https://i.imgur.com/5WieD41.png =300x120)
![](https://i.imgur.com/nYmCOdR.png =250x120)
Intermediate variables:
![](https://i.imgur.com/wZ7YhSA.png =300x120)
![](https://i.imgur.com/gUsIBDr.png =300x120)
To determine query variables, we need to identify those events about which we need to ask questions.
To determine evidence variables, we need to identify those events about which we can collect information.
Intermediate levels are less strict, adding them when necessary and useful.
## Adding edges
Edges in Bayesian Networks represent the causal relationshipts between variables. For each variable $x$, to find its parents, we ask "what is the set of variables that we regard as tthe direct causes of $x$"? To find its descendants, we ask "what does $x$ cause or have influence on?"
This knowledge usually comes from expert knowledge.
## Constructing CPTs
CPTs fall into two categories:
1. CPTs for conditional probabilities such as Pr($x\mid y$)
1. CPTs for a prior single query variables such as Pr($y$)
For first category:
![](https://i.imgur.com/FYlnWFX.png =300x120)
For second category:
![](https://i.imgur.com/ttg1bdG.png)
CPT knowledge is much harder to build than other knowledge. It usually comes from collecting data.
# What is Bayesian Network used for? INFERENCE!
There are at least **four** general types of queries that can be posed with respect to Bayesian network:
1. probability of some specific set of variables
1. posteror query given observation
1. maximum a posteriori hypothesis (MAP)
1. most probable explanation (MPE)
## Evaluation of specific set of values of variables
This is one of the simplest queries that can be made. The computation is completely followed from probabilities and conditional probability tables given.
For example, let's ask what's the probability for an $A$ to be observed: Pr$(A = true)$.
![](https://i.imgur.com/gL3C9Y0.png)
> Capital letters are variables, non-capital letters are values, which is either $true$ or $false$ here.
Pr($A$)
= $\Sigma_{p, r, c, e, q_3, q_4, j}$Pr($A = true\mid P = p, R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j$)Pr($P = p, R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j$) (Marginalization)
= $\Sigma_{p, r, c, e, q_3, q_4, j}$Pr($A = true\mid P = p, R = r$)Pr($P = p, R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j$) (Markovian Assumption)
= $\Sigma_{p, r, c, e, q_3, q_4, j}$Pr($A = true\mid P = p, R = r$)Pr($P = p \mid R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j$)Pr($R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j$) (Chain rule)
= $\Sigma_{p, r, c, e, q_3, q_4, j}$Pr($A = true\mid P = p, R = r$)Pr($P$)Pr($R = r, C = c, E = e, Q_3 = q_3, Q_4 = q_4, J = j$) (Markovian Assumption)
= ... (using chain rule in probability theory)
Computation Tools: [SAMLAM](http://reasoning.cs.ucla.edu/samiam/)
## Posterior probabilities
The most common query in Bayesian network: when some evidence is observed, how the prior probabilities of other variables can be changed? **Bayes' Rule** can be used here.
Assume we know:
(1) a prior knowledge about $\theta$, which is a probabilistic distribution $P(\theta)$
(2) the "causal" relationship between $\theta$ and the occurrence of events: $P(x\mid\theta)$
$\Rightarrow P(\theta\mid x) = \frac{P(\theta)P(x\mid \theta)}{P(x)}$ where $P(x_i) = \Sigma_{\theta'}P(\theta')P(x\mid\theta)$
![](https://i.imgur.com/74YmjkQ.png)
For example, it is observed that $J$ is false, $\textit{i.e.}$ Jack didn't confirm the answer. Then what about the probability of $Q_4$ being true or false?
Based on Bayes' rule:
Pr($Q_4 = true\mid J = false$)
= Pr($Q_4 = true$)Pr($J = false\mid Q_4 = true)/$Pr$(J = false)$
= 0.5*0.20/0.405 (all the values are copied from a prior probability tables)
= 0.247
## Maximum a posteriori hypothesis (MAP)
Given what we know about some evidence $x$, what is most likely event to happen?
For example, if we saw that John confirmed the answer, $\textit{i.e.}$ $J$ is $true$, is $Q_4$ likely to be $true$ or not?
Formally, given the value $x$ of observation, we select a value $\theta'$ of $\theta$ that maximizes the posterior distribution $P(\theta\mid x)$, maximizing the posterior probability:
$$\theta' = argmax_{\theta'}P(\theta\mid x)$$
![](https://i.imgur.com/oJVndhY.png)
In our example:
$P(Q_4 = true\mid J = true)$
$= (P(Q_4= true)*P(J=true\mid Q_4 = true))/P(J=true)$
$= (P(Q_4= true)*P(J=true\mid Q_4 = true))/0.595$
$= (0.5*0.99)/1 = 0.832$
$P(Q_4 = false\mid J = true)$
$= (P(Q_4= false)*P(J=true\mid Q_4 = false))/P(J=true)$
$= (P(Q_4= false)*P(J=true\mid Q_4 = false))/0.595$
$= (0.2*0.99)/1 = 0.328$
Since 0.832 > 0.328, $Q_4$ is likely to be $true$.
So given that John confirmed the answer, we can confidently infer that question 4 should be answered right.
## Most probable explanation
Given the observation, what's the most likely explanation? The goal here is to identify the most probable instantiation of network variables given some evidence.
MPE is a sub-problem of MAP
MPE: given some evidence, ask for the joint probability of **all** other variables
MAP: given some evidence, ask for the probability of **some** other variables
> Given observed values for some variables, what is the configuration of the **whole Bayesian Network** that has the highest probability?
Formally, given observed evidence $x$, MPE refers to the instantiation of **all** other variables such that their **joint probability** is highest possible:
$$MPE(x) = argmax_{v_1,v_2...v_3}Pr(V_1=v_1, V_2=v_2...\mid E = x)$$
For example, if "A" is observed, $\textit{i.e.}$ $O = true$, how to assign the values for all other variables?
MPE($O=true$) = $argmax_{p, r, c, e, j, q_3, q_4}$Pr($P = p, R = r, C = c, E = e, J = j, Q_3 = q_3, Q_4 = q_4\mid A = true$)
> It's too hard to compute so ...