Matthew Tang
CS188 Notes Part 2
Try
HackMD
Matthew Tang
·
Follow
Last edited by
Matthew Tang
on
Dec 9, 2019
Linked with GitHub
Contributed by
Edit
CS188 Notes Part 2
Lecture 13: 10/15/19
Uncertainty
Observed variables: Agent knows things about the state (sensor readings)
Unobserved variables: Agent reasons about other aspects (where an object is)
Random variables
Note that
+
r
means True,
−
r
means False
Probability distributions
Marginal distributions: sub-tables that eliminiate variables
"Summing out" is marginalization over a variable
Conditional probabilities
P
(
a
|
b
)
=
P
(
a
,
b
)
P
(
b
)
Select joint probabilities matching the evidence
Then normalize the section
Probabilistic Inference
Compute a desired probability from other known probabilities (ex. conditional from joint)
Probabilities change with new (given) evidence
Inference by enumeration
Evidence variables:
E
1
,
…
E
k
Query variables:
Q
Hidden variables:
H
1
,
…
H
r
We want
P
(
Q
|
e
1
,
…
e
k
)
Steps
Select entries consistent with evidence
Sum out H to get joint of Query and evidence
Normalize (
1
P
(
Q
)
)
Time:
O
(
d
n
)
Space:
O
(
d
n
)
Sometimes have conditional dist but want joint
Product rule:
P
(
y
)
P
(
x
|
y
)
=
P
(
x
,
y
)
Chain rule:
P
(
x
1
,
x
2
,
…
x
n
)
=
Π
i
P
(
x
i
|
x
1
…
x
i
−
1
)
Bayes' Rule:
P
(
x
,
y
)
=
P
(
x
|
y
)
P
(
y
)
=
P
(
y
|
x
)
P
(
x
)
P
(
y
)
=
∑
x
′
P
(
y
|
x
′
)
P
(
x
′
)
Lecture 14: 10/22/19
Bayes' Nets
Models are always simplifications of the world
Use probabilistic models to reason about unknown variables
Absolute/unconditional independence
If
X
⊥
⊥
Y
,
P
(
x
,
y
)
=
P
(
x
)
P
(
y
)
Conditional independence
∀
x
,
y
,
z
:
P
(
x
,
y
|
z
)
=
P
(
x
|
z
)
P
(
y
|
z
)
Equivalently,
P
(
x
|
z
,
y
)
=
P
(
x
|
z
)
Joint distribution tables are too large to explicitly represent
Hard to learn anything empirically more than a few variables at a time
Bayes' Net: technique for describing complex joint distribution (model) using conditional probabilities
Graphical model, how variables locally interact
Graphical Model Notation
Bayes net = DAG + conditional probability table (CPT)
P
(
X
|
a
1
,
…
a
n
)
Nodes: 1 per variable (with domains)
Can be assigned (observed) or unassigned (unobserved)
Arcs: interactions/direct influence between variables
Encode conditional independence
Absolute independence = no interactions between variables
Implicitly encode joint distributions
Product of local conditional distributions
P
(
x
1
,
x
2
,
…
x
N
)
=
Π
i
=
1
n
P
(
x
i
|
p
a
r
e
n
t
s
(
X
i
)
)
Not every BN can represent every joint distribution
Lecture 15: 10/24/18
Inference by Enumeration in BN
Evidence variables: Evidence you have observed
Query variables: Probability you want
Hidden variables: Not observed
P
(
B
|
+
j
,
+
m
)
∝
B
P
(
B
,
+
j
,
+
m
)
$ = \sum\limits_{e,a}P(B, e, a, +j, +m)$
$ = \sum\limits_{e,a} P(B)P(e)P(a|B,e)P(+j|a)P(+m|a)$
Computing the joint distribution is time and memory intensive
Track factors called factors
Procedure: Join all factors, eliminate hidden variables, normalize
Join:
∀
r
,
t
:
P
(
r
,
t
)
=
P
(
r
)
P
(
t
|
r
)
Eliminate/marginalize. Take hidden variables out
Projection operation
Variable Elimination
NP-hard but tends to be much faster
Cannot solve 3-SAT
Inference by enumeration:
∑
t
∑
r
P
(
L
|
t
)
P
(
r
)
P
(
t
|
r
)
Variable elimination
∑
t
P
(
L
|
t
)
∑
r
P
(
r
)
P
(
t
|
r
)
Interleave joins and elimination
Eliminate all hidden variables
Normalize at the end
Moving summations as right as possible
Elimination order matters
Doesn't always exist an ordering with small factors
Lecture 16: 10/29/19
Independence in a BN
Yes: Prove using algebra
No: Give a counterexample
Typically look for independence relations from the graph structure (the numbers could just work out that way)
Independence Properties
Study independence properties for triples
For 1: Never independent of self
For 2: Just look if there is an edge
For 3+: Can be reduced to triples of these canonical chains
Causal Chains
P
(
x
,
y
,
z
)
=
P
(
x
)
P
(
y
|
x
)
P
(
z
|
y
)
Not guaranteed
X
⊥
⊥
Z
Guaranteed
X
⊥
⊥
Z
|
Y
Common Cause
P
(
x
,
y
,
z
)
=
P
(
y
)
P
(
x
|
y
)
P
(
z
|
y
)
Not guaranteed
X
⊥
⊥
Z
Guaranteed
X
⊥
⊥
Z
|
Y
Common Effect
P
(
x
,
y
,
z
)
=
P
(
x
)
P
(
y
)
P
(
z
|
x
,
y
)
Guaranteed
X
⊥
⊥
Z
NOT guaranteed
X
⊥
⊥
Z
|
Y
Observing an effect activates influence between causes
D-Separation
Reachability is not enough (fails for common effects)
X
⊥
⊥
Y
|
Z
if X and Y are d-separated by Z
Consider all undirected paths from X to Y
No active paths = independent
Active path if each triple is active
Causal chain: A->B->C where B is unobserved
Common cause: A <- B -> C where B is unobserved
Common effect: A -> B <- C where B or one of its descendents is OBSERVED
A single inactive segment can block a path
If 1 or more path is active, independence not guaranteed
If all paths are inactive, independence guaranteed
Topology Limits Distributions
Fully conditioning can encode any distribution
Lecture 17: 10/31/19
Sampling
Repeated simulations
Draw N samples from a sampling distribution S
Compute an approximate posterior probability which converges to true probability P
Get sample u from uniform distribution
[
0
,
1
)
Divide the interval into subintervals equal to size of probability
Prior Sampling
Run through nodes in topological order
Sample
x
i
from
P
(
X
i
|
P
a
r
e
n
t
s
(
X
i
)
)
Return (
x
1
,
x
2
,
…
x
n
)
Sampling procedure is consistent
If we do procedure long enough, it converges
Monte Carlo Estimation: Estimate expected value of f(X)
Rejection Sampling
Only consider samples that match evidence
Input: evidence instantiation
Run through nodes in topological order
Sample
x
i
from
P
(
X
i
|
P
a
r
e
n
t
s
(
X
i
)
)
If
x
i
not consistent with evidence, reject it
Return (
x
1
,
x
2
,
…
x
n
)
Sampling procedure is consistent
Problem: If evidence is unlikely, reject lots of samples
Likelihood Weighting
Fix evidence variables, sample the rest
Problem: Sample distribution is not consistent
Solution: Weight by probability of evidence given parents
Algorithm
Instantiate weights to be 1
Run through nodes in topological order
If
x
i
is an evidence variable,
X
i
=
observation of
x
i
for
X
i
Set
w
=
w
∗
P
(
X
i
|
P
a
r
e
n
t
s
(
X
i
)
)
Else:
x
i
from
P
(
X
i
|
P
a
r
e
n
t
s
(
X
i
)
)
Return (
x
1
,
x
2
,
…
x
n
), w
Problem: Evidence influences choice of downstream variables, but not upstream variables
Gibbs Sampling
Fix evidence variables
Start with random instantiation.
Resample X from P(X | rest of variables)
Sample 1 variable at a time conditioned on the rest
Idea: No hidden variables
Lecture 18: 11/5/19
Decision Networks
Maximize Expected Utility (MEU)
Bayes net with nodes for utility
Calculate expected utility for each action
Chances nodes, action nodes, utility nodes
Procedure
Instantiate all evidencce
Select action nodes along the way
Calculate posterior for all parents
Calculate expected utility for each action
Choose maximizing action
Value of information = expected gain in MEU with new info
When you get evidence, your MEU might be less
But now you can make the correct decision knowing that life sucks
Value of Perfect Information
MEU(F=bad) + MEU(F=good) - MEU(none)
V
P
I
(
E
′
|
e
)
=
(
∑
e
′
P
(
e
′
|
e
)
M
E
U
(
e
,
e
′
)
)
−
M
E
U
(
e
)
M
E
U
(
e
)
=
max
a
∑
s
P
(
s
|
e
)
U
(
s
,
a
)
Properties
VPI is always nonnegative
VPI is nonadditive
V
P
I
(
E
j
,
E
k
|
e
)
≠
V
P
I
(
E
j
|
e
)
+
V
P
I
(
E
k
|
e
)
VPI is order independent
V
P
I
(
E
j
,
E
k
|
e
)
=
V
P
I
(
E
j
|
e
)
+
V
P
I
(
E
k
|
e
,
e
J
)
in any order
Lecture 19: 11/7/19
POMDPs
Observation function
O
(
s
,
o
)
=
P
(
o
|
s
)
POMDPs are MDPs over belief states b
One way to solve: Use truncated expectimax for approximate values of actions
VPI based agent
Hidden Markov Models
Reasoning about a sequence of observations over time
Value of X at a given time is called the state
Parameters: transition probabilities/dynamics show how state evolves over time
Stationary assumption: transition probabilities the same at all times
Initial distribution
P
(
X
1
)
Transitions:
P
(
X
t
|
X
t
−
1
)
Emissions:
P
(
E
t
,
X
t
)
Markov hidden process: Future depends on past via the present
Current observation is independent of everything else given current state
Mini-Forward Algorithm
Find
P
(
X
)
on some day t
P
(
x
t
)
=
∑
x
t
−
1
P
(
x
t
|
x
t
−
1
)
P
(
x
t
−
1
)
Can converge as
t
→
∞
Stationary distribution:
P
∞
=
P
∞
+
1
Lecture 20: 11/12/19
Localization: Only 1 sensor reading could be wrong. Compute probability distribution of where robot can be
B
(
X
T
)
=
P
(
X
T
|
e
1
:
t
)
Inference (Passage of time)
Current belief
B
(
X
T
)
=
P
(
X
T
|
e
1
:
t
)
After 1 time step:
P
(
X
t
+
1
|
e
1
:
t
)
Prediction of new time step from old evidence
B
′
(
X
t
+
1
)
=
∑
x
t
P
(
X
′
|
x
t
)
B
(
x
t
)
No
e
t
+
1
Observation
We have current belief
B
′
(
X
t
+
1
)
=
∑
x
t
P
(
X
′
|
x
t
)
B
(
x
t
)
New evidence comes in:
P
(
X
t
+
1
|
e
1
:
t
+
1
)
∝
P
(
X
t
+
1
,
e
t
|
e
1
:
t
+
1
)
=
P
(
e
t
+
1
|
e
1
:
t
,
X
t
+
1
)
P
(
X
t
+
1
|
e
1
:
t
)
Or compactly,
P
(
X
t
+
1
|
e
1
:
t
+
1
)
∝
P
(
e
t
+
1
|
X
t
+
1
)
B
′
(
X
t
+
1
)
Must renormalize
Forward Algorithm
P
(
x
t
|
e
1
:
t
)
=
P
(
e
t
|
x
T
)
∑
x
t
−
1
P
(
x
t
|
x
t
−
1
)
P
(
x
t
−
1
,
e
1
:
t
−
1
)
Particle Filtering
Just use sampling instead of dealing with exact beliefs
Good if
|
X
|
is too big to store in memory (if X is continuous)
Track samples of X, not all values
Time per step is linear in the number of samples
Generall N particles <<
|
X
|
x
′
=
s
a
m
p
l
e
(
P
(
X
′
|
x
)
)
Captures passage of time
Observation:
Compute a weight for each particle of how likely is based on evidence
B
(
X
)
∝
P
(
e
|
X
)
B
′
(
X
)
Resample:
Resample particles to go back to all weight 1 particles
Resample N times from weighted sample distribution (equivalent to renormalizing)
Dynamic Bayes Nets (DBNs)
Track multiple variables over time using multiple evidence
Idea: Repeat a fixed Bayes net structure at each time
Lecture 21: 11/14/19
Classification
Spam vs ham prediction of new future emails
Features:
Text Patterns, words
Other features unrelated to email content: Sender in contacts
Ex. Digit recognition
Get a large collection of example images, labeled
Want to learn to predict labels of new, future digit images
Features: pixels, num components, aspect ratio, numloops
Model based classification
Naive bayes: Assume all features are independent of label
Inference: Get joining probability distribution
Count the probability that each pixel is on for that label
Bag of words model: Features = Wi is word at position i
Insensitive to word order or reordering
Weight each word along the way to give a total score for spam vs ham
Training vs Test
Training set, Held out set, test data
Use test data to get a sense of accuracy in the real world
Features: attribute-value pairs to characterize each x
Never peek at the test set
Evaluation:
Accuracy: fraction of instances predicted correctly
Don't want false positives
Friend classified as a gorilla
Overfitting:
When you try to maximize accuracy on training data
Odds ratio of infinity blows up
Parameter Estimation
Estimating the distribution of a RV
Elicitation: ask a human
Empirically: use training data (learning)
Maximum Likelihood Estimation
Data: Observed set D of heads and tails
Hypothesis space: binomial distributions
Learning: Finding
θ
is an optimization problem
MLE:
θ
^
=
a
r
g
max
θ
ln
θ
α
H
(
1
−
θ
)
α
T
$ = \frac{\alpha_H}{\alpha_H + \alpha_T}$
Another option is to consider the most likely parameter value given the data
Laplace Smoothing
Pretend that you saw every outcome once more than you actually did
Avoids the division by 0 problem
Extended laplace estimate: pretend we saw everything
k
times
Tuning on held out data
Learn parameters from training data
Tune hyperparamaters on different data
For each value of hyperparameters, train and test on held out data
Choose the best value, do a final test on test data
Lecture 22: 11/19/19
Reinforcement Learning vs Supervised Learning
RL finds policy - mapping from states to actions
Perceptron
Inputs are feature vectors
Each feature has a weight, sum is the activation
Activation(x) =
w
⋅
f
(
x
)
Positive outputs +1, negative output -1
Weight updates
Start with weights = 0
For each training instance:
Classify current weights
If correct, don't do anything. If wrong, adjust the weight vector (
w
=
w
+
y
∗
⋅
f
)
Add/subtract the feature vector
Multi-Class Perceptron
For multiclass, there is a weight vector for each class
Score of class y:
w
y
⋅
f
(
x
)
Prediction highest score wins:
y
=
a
r
g
max
y
w
y
⋅
f
(
x
)
Algorithm:
Start with weights all 0
Pick up training examples 1 by 1
Predict with current weights
If wrong,
w
y
=
w
y
−
f
(
x
)
,
w
y
∗
=
w
y
∗
+
f
(
x
)
Properties
Separability: true if some parameters get the training set perfectly correct
Convergence: if the training is separable, perceptron will eventually converge (binary case)
Mistake bound:
k
δ
2
Problems
If data is not separable, weights might thrash
Averaged perceptron can help
Mediocre generalization sometimes barely finds the solution
Overtraining/overfitting
Non-separable case: use probabilistic decision
Very positive score, probability close to 1
Sigmoid function:
ϕ
(
z
)
=
1
1
+
e
−
z
MLE
max
w
l
l
(
w
)
=
max
w
∑
i
log
P
(
y
(
i
)
|
x
(
i
)
,
w
)
Lecture 23: 11/21/19
Gradient Ascent
Perform update in uphill direction for each coordinate
For
g
(
w
1
,
w
2
)
w
1
←
w
1
+
α
∂
g
∂
w
1
(
w
1
,
w
2
)
w
←
w
+
α
∇
w
g
(
w
)
g
(
w
)
=
∑
i
log
P
(
y
(
i
)
|
x
(
i
)
,
w
)
Keep doing this until update changes
w
by around
0.1
−
1
%
Can't make
α
too big
Mini-batch gradient ascent: compute gradients for points in parallel
Stop when log likelihood of hold-out data begins to decrease
Neural Networks
2 layer neural network can approximate any continuous function
Last year = still logistic regression
More layers before this last layer
Feasure are learned rather than hand-designed
Lecture 24: 12/3/19
Deep Neural Networks
Softmax:
P
(
y
1
|
x
;
w
)
=
e
z
1
e
z
1
+
e
z
2
+
e
z
3
Deep neural network:
max
w
l
l
(
w
)
=
max
w
∑
i
log
P
(
y
(
i
)
|
x
(
i
)
;
w
)
w is just a much larger vector
Run gradient ascent, stop when log likelihood of hold out data starts to decrease
Covariate shift: correlations present in train data not in test data
Inverse RL
Usually we try to find an optimal policy
Inverse planning: We see optimal behavior and try to recover the reward function that demonstrates the reward behavior
Reward engineering is hard, try demonstrating behavior and extract reward function
Aka inverse planning, inverse optimal control
More causal model of learning
IRL Formulation
Given
ξ
D
, find
R
(
s
,
a
)
=
θ
T
ϕ
(
s
,
a
)
such that
R
(
ξ
D
)
≥
max
ξ
R
(
ξ
)
But 0/constant reward ends up being a solution
Maximum Margin Planning
R
(
ξ
D
)
≥
max
ξ
R
[
ξ
+
l
(
ξ
,
ξ
D
)
]
Search for a reward function giving a bonus to trajectories farther from the demonstration
Makes it a little more difficult to pick the demonstration
max
θ
[
θ
T
ϕ
(
ξ
D
)
−
max
ξ
[
θ
T
ϕ
(
ξ
)
+
l
(
ξ
,
ξ
D
)
]
]
ξ
θ
∗
is arg max
Taking gradient is hard because there's a max nested
Take a subgradient: Pick different directions
∇
θ
=
ϕ
(
ξ
D
)
−
ϕ
(
ξ
θ
∗
)
What if the demonstrator is not optimal?
Bayesian view:
P
(
ξ
D
|
θ
)
, run inference on
θ
Use softmax:
P
(
ξ
D
|
θ
)
∝
e
β
θ
T
ϕ
(
ξ
D
)
Bayesian belief update using demonstration as evidence
Maximum Entropy RL just looks at maximum likelihood estimate
CS188 Notes Part 2
Lecture 13: 10/15/19
Uncertainty
Probabilistic Inference
Lecture 14: 10/22/19
Bayes' Nets
Graphical Model Notation
Lecture 15: 10/24/18
Inference by Enumeration in BN
Variable Elimination
Lecture 16: 10/29/19
Independence Properties
D-Separation
Topology Limits Distributions
Lecture 17: 10/31/19
Sampling
Prior Sampling
Rejection Sampling
Likelihood Weighting
Gibbs Sampling
Lecture 18: 11/5/19
Decision Networks
Value of Perfect Information
Lecture 19: 11/7/19
POMDPs
Hidden Markov Models
Mini-Forward Algorithm
Lecture 20: 11/12/19
Inference (Passage of time)
Observation
Forward Algorithm
Particle Filtering
Dynamic Bayes Nets (DBNs)
Lecture 21: 11/14/19
Classification
Model based classification
Training vs Test
Parameter Estimation
Maximum Likelihood Estimation
Laplace Smoothing
Lecture 22: 11/19/19
Perceptron
Multi-Class Perceptron
Properties
Problems
MLE
Lecture 23: 11/21/19
Gradient Ascent
Neural Networks
Lecture 24: 12/3/19
Deep Neural Networks
Inverse RL
IRL Formulation
Expand all
Back to top
Go to bottom
CS188 Notes Part 2
Lecture 13: 10/15/19
Uncertainty
Probabilistic Inference
Lecture 14: 10/22/19
Bayes' Nets
Graphical Model Notation
Lecture 15: 10/24/18
Inference by Enumeration in BN
Variable Elimination
Lecture 16: 10/29/19
Independence Properties
D-Separation
Topology Limits Distributions
Lecture 17: 10/31/19
Sampling
Prior Sampling
Rejection Sampling
Likelihood Weighting
Gibbs Sampling
Lecture 18: 11/5/19
Decision Networks
Value of Perfect Information
Lecture 19: 11/7/19
POMDPs
Hidden Markov Models
Mini-Forward Algorithm
Lecture 20: 11/12/19
Inference (Passage of time)
Observation
Forward Algorithm
Particle Filtering
Dynamic Bayes Nets (DBNs)
Lecture 21: 11/14/19
Classification
Model based classification
Training vs Test
Parameter Estimation
Maximum Likelihood Estimation
Laplace Smoothing
Lecture 22: 11/19/19
Perceptron
Multi-Class Perceptron
Properties
Problems
MLE
Lecture 23: 11/21/19
Gradient Ascent
Neural Networks
Lecture 24: 12/3/19
Deep Neural Networks
Inverse RL
IRL Formulation
Expand all
Back to top
Go to bottom
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up
Comment