CS188 Notes Part 2

Lecture 13: 10/15/19

Uncertainty

Observed variables: Agent knows things about the state (sensor readings)
Unobserved variables: Agent reasons about other aspects (where an object is)
Random variables
- Note that
  $+ r$ means True,
  $- r$ means False
Probability distributions
Marginal distributions: sub-tables that eliminiate variables
- "Summing out" is marginalization over a variable
Conditional probabilities
- $P (a | b) = \frac{P (a, b)}{P (b)}$
- Select joint probabilities matching the evidence
- Then normalize the section

Probabilistic Inference

Compute a desired probability from other known probabilities (ex. conditional from joint)
Probabilities change with new (given) evidence
Inference by enumeration
- Evidence variables:
  $E_{1}, \dots E_{k}$
- Query variables:
  $Q$
- Hidden variables:
  $H_{1}, \dots H_{r}$
We want
$P (Q | e_{1}, \dots e_{k})$
Steps
- 1. Select entries consistent with evidence
- 1. Sum out H to get joint of Query and evidence
- 1. Normalize (
    $\frac{1}{P (Q)}$ )
Time:
$O (d^{n})$
Space:
$O (d^{n})$
Sometimes have conditional dist but want joint
Product rule:
$P (y) P (x | y) = P (x, y)$
Chain rule:
$P (x_{1}, x_{2}, \dots x_{n}) = Π_{i} P (x_{i} | x_{1} \dots x_{i - 1})$
Bayes' Rule:
$P (x, y) = P (x | y) P (y) = P (y | x) P (x)$
$P (y) = \sum_{x^{'}} P (y | x^{'}) P (x^{'})$

Lecture 14: 10/22/19

Bayes' Nets

Models are always simplifications of the world
Use probabilistic models to reason about unknown variables
Absolute/unconditional independence
- If
  $X ⊥ ⊥ Y, P (x, y) = P (x) P (y)$
Conditional independence
- $\forall x, y, z : P (x, y | z) = P (x | z) P (y | z)$
- Equivalently,
  $P (x | z, y) = P (x | z)$
Joint distribution tables are too large to explicitly represent
Hard to learn anything empirically more than a few variables at a time
Bayes' Net: technique for describing complex joint distribution (model) using conditional probabilities
- Graphical model, how variables locally interact

Graphical Model Notation

Bayes net = DAG + conditional probability table (CPT)
- $P (X | a_{1}, \dots a_{n})$
Nodes: 1 per variable (with domains)
- Can be assigned (observed) or unassigned (unobserved)
Arcs: interactions/direct influence between variables
- Encode conditional independence
Absolute independence = no interactions between variables
Implicitly encode joint distributions
- Product of local conditional distributions
- $P (x_{1}, x_{2}, \dots x_{N}) = Π_{i = 1}^{n} P (x_{i} | p a r e n t s (X_{i}))$
- Not every BN can represent every joint distribution

Lecture 15: 10/24/18

Inference by Enumeration in BN

Evidence variables: Evidence you have observed
Query variables: Probability you want
Hidden variables: Not observed
$P (B | + j, + m) \propto_{B} P (B, + j, + m)$
- $ = \sum\limits_{e,a}P(B, e, a, +j, +m)$
- $ = \sum\limits_{e,a} P(B)P(e)P(a|B,e)P(+j|a)P(+m|a)$
Computing the joint distribution is time and memory intensive
Track factors called factors
Procedure: Join all factors, eliminate hidden variables, normalize
- 1. Join:
    $\forall r, t : P (r, t) = P (r) P (t | r)$
- 1. Eliminate/marginalize. Take hidden variables out
  - Projection operation

Variable Elimination

NP-hard but tends to be much faster
- Cannot solve 3-SAT
Inference by enumeration:
$\sum_{t} \sum_{r} P (L | t) P (r) P (t | r)$
Variable elimination
$\sum_{t} P (L | t) \sum_{r} P (r) P (t | r)$
- Interleave joins and elimination
Eliminate all hidden variables
Normalize at the end
Moving summations as right as possible
Elimination order matters
- Doesn't always exist an ordering with small factors

Lecture 16: 10/29/19

Independence in a BN
- Yes: Prove using algebra
- No: Give a counterexample
Typically look for independence relations from the graph structure (the numbers could just work out that way)

Independence Properties

Study independence properties for triples
- For 1: Never independent of self
- For 2: Just look if there is an edge
- For 3+: Can be reduced to triples of these canonical chains
1. Causal Chains
- $P (x, y, z) = P (x) P (y | x) P (z | y)$
- Not guaranteed
  $X ⊥ ⊥ Z$
- Guaranteed
  $X ⊥ ⊥ Z | Y$
1. Common Cause
- $P (x, y, z) = P (y) P (x | y) P (z | y)$
- Not guaranteed
  $X ⊥ ⊥ Z$
- Guaranteed
  $X ⊥ ⊥ Z | Y$
1. Common Effect
- $P (x, y, z) = P (x) P (y) P (z | x, y)$
- Guaranteed
  $X ⊥ ⊥ Z$
- NOT guaranteed
  $X ⊥ ⊥ Z | Y$
- Observing an effect activates influence between causes

D-Separation

Reachability is not enough (fails for common effects)
$X ⊥ ⊥ Y | Z$ if X and Y are d-separated by Z
Consider all undirected paths from X to Y
No active paths = independent
Active path if each triple is active
- Causal chain: A->B->C where B is unobserved
- Common cause: A <- B -> C where B is unobserved
- Common effect: A -> B <- C where B or one of its descendents is OBSERVED
A single inactive segment can block a path
If 1 or more path is active, independence not guaranteed
If all paths are inactive, independence guaranteed

Topology Limits Distributions

Fully conditioning can encode any distribution

Lecture 17: 10/31/19

Sampling

Repeated simulations
Draw N samples from a sampling distribution S
Compute an approximate posterior probability which converges to true probability P
1. Get sample u from uniform distribution
  $[0, 1)$
1. Divide the interval into subintervals equal to size of probability

Prior Sampling

1. Run through nodes in topological order
- Sample
  $x_{i}$ from
  $P (X_{i} | P a r e n t s (X_{i}))$
1. Return (
  $x_{1}, x_{2}, \dots x_{n}$ )
Sampling procedure is consistent
- If we do procedure long enough, it converges
Monte Carlo Estimation: Estimate expected value of f(X)

Rejection Sampling

Only consider samples that match evidence
Input: evidence instantiation
1. Run through nodes in topological order
- Sample
  $x_{i}$ from
  $P (X_{i} | P a r e n t s (X_{i}))$
1. If
  $x_{i}$ not consistent with evidence, reject it
1. Return (
  $x_{1}, x_{2}, \dots x_{n}$ )
Sampling procedure is consistent
Problem: If evidence is unlikely, reject lots of samples

Likelihood Weighting

Fix evidence variables, sample the rest
Problem: Sample distribution is not consistent
Solution: Weight by probability of evidence given parents
Algorithm
- 1. Instantiate weights to be 1
- 1. Run through nodes in topological order
  - If
    $x_{i}$ is an evidence variable,
  - $X_{i} =$ observation of
    $x_{i}$ for
    $X_{i}$
  - Set
    $w = w * P (X_{i} | P a r e n t s (X_{i}))$
- Else:
  $x_{i}$ from
  $P (X_{i} | P a r e n t s (X_{i}))$
- 1. Return (
    $x_{1}, x_{2}, \dots x_{n}$ ), w
Problem: Evidence influences choice of downstream variables, but not upstream variables

Gibbs Sampling

1. Fix evidence variables
1. Start with random instantiation.
1. Resample X from P(X | rest of variables)
- Sample 1 variable at a time conditioned on the rest
- Idea: No hidden variables

Lecture 18: 11/5/19

Decision Networks

Maximize Expected Utility (MEU)
Bayes net with nodes for utility
Calculate expected utility for each action
Chances nodes, action nodes, utility nodes
Procedure
- 1. Instantiate all evidencce
- 1. Select action nodes along the way
- 1. Calculate posterior for all parents
- 1. Calculate expected utility for each action
- 1. Choose maximizing action
Value of information = expected gain in MEU with new info
When you get evidence, your MEU might be less
- But now you can make the correct decision knowing that life sucks

Value of Perfect Information

MEU(F=bad) + MEU(F=good) - MEU(none)
$V P I (E^{'} | e) = (\sum_{e^{'}} P (e^{'} | e) M E U (e, e^{'})) - M E U (e)$
$M E U (e) = max_{a} \sum_{s} P (s | e) U (s, a)$
Properties
- VPI is always nonnegative
- VPI is nonadditive
  - $V P I (E_{j}, E_{k} | e) \neq V P I (E_{j} | e) + V P I (E_{k} | e)$
- VPI is order independent
  - $V P I (E_{j}, E_{k} | e) = V P I (E_{j} | e) + V P I (E_{k} | e, e_{J})$ in any order

Lecture 19: 11/7/19

POMDPs

Observation function
$O (s, o) = P (o | s)$
POMDPs are MDPs over belief states b
One way to solve: Use truncated expectimax for approximate values of actions
VPI based agent

Hidden Markov Models

Reasoning about a sequence of observations over time
Value of X at a given time is called the state
Parameters: transition probabilities/dynamics show how state evolves over time
Stationary assumption: transition probabilities the same at all times
Initial distribution
$P (X_{1})$
Transitions:
$P (X_{t} | X_{t - 1})$
Emissions:
$P (E_{t}, X_{t})$
Markov hidden process: Future depends on past via the present
Current observation is independent of everything else given current state

Mini-Forward Algorithm

Find
$P (X)$ on some day t
$P (x_{t}) = \sum_{x_{t - 1}} P (x_{t} | x_{t - 1}) P (x_{t - 1})$
Can converge as
$t \to \infty$
Stationary distribution:
$P_{\infty} = P_{\infty + 1}$

Lecture 20: 11/12/19

Localization: Only 1 sensor reading could be wrong. Compute probability distribution of where robot can be
$B (X_{T}) = P (X_{T} | e_{1} : t)$

Inference (Passage of time)

Current belief
$B (X_{T}) = P (X_{T} | e_{1 : t})$
After 1 time step:
$P (X_{t + 1} | e_{1 : t})$
- Prediction of new time step from old evidence
- $B^{'} (X_{t + 1}) = \sum_{x_{t}} P (X^{'} | x_{t}) B (x_{t})$
- No
  $e_{t + 1}$

Observation

We have current belief
$B^{'} (X_{t + 1}) = \sum_{x_{t}} P (X^{'} | x_{t}) B (x_{t})$
New evidence comes in:
$P (X_{t + 1} | e_{1 : t + 1}) \propto P (X_{t + 1}, e_{t} | e_{1 : t + 1})$
- $= P (e_{t + 1} | e_{1 : t}, X_{t + 1}) P (X_{t + 1} | e_{1 : t})$
- Or compactly,
  $P (X_{t + 1} | e_{1 : t + 1}) \propto P (e_{t + 1} | X_{t + 1}) B^{'} (X_{t + 1})$
- Must renormalize

Forward Algorithm

$P (x_{t} | e_{1 : t}) = P (e_{t} | x_{T}) \sum_{x_{t - 1}} P (x_{t} | x_{t - 1}) P (x_{t - 1}, e_{1 : t - 1})$

Particle Filtering

Just use sampling instead of dealing with exact beliefs
Good if
$| X |$ is too big to store in memory (if X is continuous)
Track samples of X, not all values
Time per step is linear in the number of samples
Generall N particles <<
$| X |$
$x^{'} = s a m p l e (P (X^{'} | x))$
- Captures passage of time
Observation:
- Compute a weight for each particle of how likely is based on evidence
- $B (X) \propto P (e | X) B^{'} (X)$
Resample:
- Resample particles to go back to all weight 1 particles
- Resample N times from weighted sample distribution (equivalent to renormalizing)

Dynamic Bayes Nets (DBNs)

Track multiple variables over time using multiple evidence
Idea: Repeat a fixed Bayes net structure at each time

Lecture 21: 11/14/19

Classification

Spam vs ham prediction of new future emails
Features:
- Text Patterns, words
- Other features unrelated to email content: Sender in contacts
Ex. Digit recognition
- Get a large collection of example images, labeled
- Want to learn to predict labels of new, future digit images
- Features: pixels, num components, aspect ratio, numloops

Model based classification

Naive bayes: Assume all features are independent of label
Inference: Get joining probability distribution
Count the probability that each pixel is on for that label
Bag of words model: Features = Wi is word at position i
- Insensitive to word order or reordering
- Weight each word along the way to give a total score for spam vs ham

Training vs Test

Training set, Held out set, test data
Use test data to get a sense of accuracy in the real world
Features: attribute-value pairs to characterize each x
Never peek at the test set
Evaluation:
- Accuracy: fraction of instances predicted correctly
  - Don't want false positives
- Friend classified as a gorilla
- Overfitting:
  - When you try to maximize accuracy on training data
Odds ratio of infinity blows up

Parameter Estimation

Estimating the distribution of a RV
Elicitation: ask a human
Empirically: use training data (learning)

Maximum Likelihood Estimation

Data: Observed set D of heads and tails
Hypothesis space: binomial distributions
Learning: Finding
$θ$ is an optimization problem
MLE:
$\hat{θ} = a r g max_{θ} \ln θ^{α_{H}} (1 - θ)^{α_{T}}$
- $ = \frac{\alpha_H}{\alpha_H + \alpha_T}$
Another option is to consider the most likely parameter value given the data

Laplace Smoothing

Pretend that you saw every outcome once more than you actually did
Avoids the division by 0 problem
Extended laplace estimate: pretend we saw everything
$k$ times
Tuning on held out data
- Learn parameters from training data
- Tune hyperparamaters on different data
- For each value of hyperparameters, train and test on held out data
- Choose the best value, do a final test on test data

Lecture 22: 11/19/19

Reinforcement Learning vs Supervised Learning
- RL finds policy - mapping from states to actions

Perceptron

Inputs are feature vectors
Each feature has a weight, sum is the activation
Activation(x) =
$w \cdot f (x)$
- Positive outputs +1, negative output -1
Weight updates
- Start with weights = 0
For each training instance:
- Classify current weights
- If correct, don't do anything. If wrong, adjust the weight vector (
  $w = w + y^{*} \cdot f$ )
- Add/subtract the feature vector

Multi-Class Perceptron

For multiclass, there is a weight vector for each class
Score of class y:
$w_{y} \cdot f (x)$
Prediction highest score wins:
$y = a r g max_{y} w_{y} \cdot f (x)$
Algorithm:
- Start with weights all 0
- Pick up training examples 1 by 1
- Predict with current weights
- If wrong,
  $w_{y} = w_{y} - f (x)$ ,
  $w_{y}^{*} = w_{y}^{*} + f (x)$

Properties

Separability: true if some parameters get the training set perfectly correct
Convergence: if the training is separable, perceptron will eventually converge (binary case)
Mistake bound:
$\frac{k}{δ^{2}}$

Problems

If data is not separable, weights might thrash
Averaged perceptron can help
Mediocre generalization sometimes barely finds the solution
Overtraining/overfitting
Non-separable case: use probabilistic decision
- Very positive score, probability close to 1
- Sigmoid function:
  $ϕ (z) = \frac{1}{1 + e^{- z}}$

MLE

$max_{w} l l (w) = max_{w} \sum_{i} \log P (y^{(i)} | x^{(i)}, w)$

Lecture 23: 11/21/19

Gradient Ascent

Perform update in uphill direction for each coordinate
For
$g (w_{1}, w_{2})$
- $w_{1} \leftarrow w_{1} + α \frac{\partial g}{\partial w_{1}} (w_{1}, w_{2})$
- $w \leftarrow w + α \nabla_{w} g (w)$
- $g (w) = \sum_{i} \log P (y^{(i)} | x^{(i)}, w)$
- Keep doing this until update changes
  $w$ by around
  $0.1 - 1 %$
Can't make
$α$ too big
Mini-batch gradient ascent: compute gradients for points in parallel
Stop when log likelihood of hold-out data begins to decrease

Neural Networks

2 layer neural network can approximate any continuous function
Last year = still logistic regression
More layers before this last layer
Feasure are learned rather than hand-designed

Lecture 24: 12/3/19

Deep Neural Networks

Softmax:
$P (y_{1} | x; w) = \frac{e^{z_{1}}}{e^{z_{1}} + e^{z_{2}} + e^{z_{3}}}$
Deep neural network:
$max_{w} l l (w) = max_{w} \sum_{i} \log P (y^{(i)} | x^{(i)}; w)$
- w is just a much larger vector
- Run gradient ascent, stop when log likelihood of hold out data starts to decrease
Covariate shift: correlations present in train data not in test data

Inverse RL

Usually we try to find an optimal policy
Inverse planning: We see optimal behavior and try to recover the reward function that demonstrates the reward behavior
Reward engineering is hard, try demonstrating behavior and extract reward function
Aka inverse planning, inverse optimal control
More causal model of learning

IRL Formulation

Given
$ξ_{D}$ , find
$R (s, a) = θ^{T} ϕ (s, a)$ such that
$R (ξ_{D}) \geq max_{ξ} R (ξ)$
- But 0/constant reward ends up being a solution
Maximum Margin Planning
- $R (ξ_{D}) \geq max_{ξ} R [ξ + l (ξ, ξ_{D})]$
- Search for a reward function giving a bonus to trajectories farther from the demonstration
- Makes it a little more difficult to pick the demonstration
$max_{θ} [θ^{T} ϕ (ξ_{D}) - max_{ξ} [θ^{T} ϕ (ξ) + l (ξ, ξ_{D})]]$
- $ξ_{θ}^{*}$ is arg max
- Taking gradient is hard because there's a max nested
- Take a subgradient: Pick different directions
  - $\nabla_{θ} = ϕ (ξ_{D}) - ϕ (ξ_{θ}^{*})$
What if the demonstrator is not optimal?
Bayesian view:
$P (ξ_{D} | θ)$ , run inference on
$θ$
- Use softmax:
  $P (ξ_{D} | θ) \propto e^{β θ^{T} ϕ (ξ_{D})}$
- Bayesian belief update using demonstration as evidence
- Maximum Entropy RL just looks at maximum likelihood estimate

CS188 Notes Part 2

Lecture 13: 10/15/19

Uncertainty

Probabilistic Inference

Lecture 14: 10/22/19

Bayes' Nets

Graphical Model Notation

Lecture 15: 10/24/18

Inference by Enumeration in BN

Variable Elimination

Lecture 16: 10/29/19

Independence Properties

D-Separation

Topology Limits Distributions

Lecture 17: 10/31/19

Sampling

Prior Sampling

Rejection Sampling

Likelihood Weighting

Gibbs Sampling

Lecture 18: 11/5/19

Decision Networks

Value of Perfect Information

Lecture 19: 11/7/19

POMDPs

Hidden Markov Models

Mini-Forward Algorithm

Lecture 20: 11/12/19

Inference (Passage of time)

Observation

Forward Algorithm

Particle Filtering

Dynamic Bayes Nets (DBNs)

Lecture 21: 11/14/19

Classification

Model based classification

Training vs Test

Parameter Estimation

Maximum Likelihood Estimation

Laplace Smoothing

Lecture 22: 11/19/19

Perceptron

Multi-Class Perceptron

Properties

Problems

MLE

Lecture 23: 11/21/19

Gradient Ascent

Neural Networks

Lecture 24: 12/3/19

Deep Neural Networks

Inverse RL

IRL Formulation

Read more

CS61B MT2 Cheat Sheet

Puzzle Game 2022 Sols

Write It Powerpoint It Recap

CS188 Notes Part 1