# CS188 Notes Part 2
# Lecture 13: 10/15/19
## Uncertainty
* Observed variables: Agent knows things about the state (sensor readings)
* Unobserved variables: Agent reasons about other aspects (where an object is)
* Random variables
* Note that $+r$ means True, $-r$ means False
* Probability distributions
* Marginal distributions: sub-tables that eliminiate variables
* "Summing out" is marginalization over a variable
* Conditional probabilities
* $P(a|b) = \frac{P(a,b)}{P(b)}$
* Select joint probabilities matching the evidence
* Then normalize the section
## Probabilistic Inference
* Compute a desired probability from other known probabilities (ex. conditional from joint)
* Probabilities change with new (given) evidence
* Inference by enumeration
* Evidence variables: $E_1, \ldots E_k$
* Query variables: $Q$
* Hidden variables: $H_1, \ldots H_r$
* We want $P(Q | e_1, \ldots e_k)$
* Steps
* 1) Select entries consistent with evidence
* 2) Sum out H to get joint of Query and evidence
* 3) Normalize ($\frac{1}{P(Q)}$)
* Time: $O(d^n)$
* Space: $O(d^n)$
* Sometimes have conditional dist but want joint
* Product rule: $P(y)P(x|y) = P(x,y)$
* Chain rule: $P(x_1, x_2, \ldots x_n) = \Pi_i P(x_i | x_1 \ldots x_{i-1})$
* Bayes' Rule: $P(x,y) = P(x|y)P(y) = P(y|x)P(x)$
* $P(y) = \sum\limits_{x'} P(y|x')P(x')$
# Lecture 14: 10/22/19
## Bayes' Nets
* Models are always simplifications of the world
* Use probabilistic models to reason about unknown variables
* Absolute/unconditional independence
* If $X \perp\!\!\!\perp Y, P(x,y) = P(x)P(y)$
* Conditional independence
* $\forall x,y,z:P(x, y | z) = P(x | z) P(y | z)$
* Equivalently, $P(x|z, y) = P(x| z)$
* Joint distribution tables are too large to explicitly represent
* Hard to learn anything empirically more than a few variables at a time
* Bayes' Net: technique for describing complex joint distribution (model) using conditional probabilities
* Graphical model, how variables locally interact
### Graphical Model Notation
* Bayes net = DAG + conditional probability table (CPT)
* $P(X | a_1, \ldots a_n)$
* Nodes: 1 per variable (with domains)
* Can be assigned (observed) or unassigned (unobserved)
* Arcs: interactions/direct influence between variables
* Encode conditional independence
* Absolute independence = no interactions between variables
* Implicitly encode joint distributions
* Product of local conditional distributions
* $P(x_1, x_2, \ldots x_N) = \Pi^{n}_{i=1}P(x_i|parents(X_i))$
* Not every BN can represent every joint distribution
# Lecture 15: 10/24/18
## Inference by Enumeration in BN
* Evidence variables: Evidence you have observed
* Query variables: Probability you want
* Hidden variables: Not observed
* $P(B | +j, +m) \propto_B P(B, +j, +m)$
* $ = \sum\limits_{e,a}P(B, e, a, +j, +m)$
* $ = \sum\limits_{e,a} P(B)P(e)P(a|B,e)P(+j|a)P(+m|a)$
* Computing the joint distribution is time and memory intensive
* Track factors called factors
* Procedure: Join all factors, eliminate hidden variables, normalize
* 1) Join: $\forall r,t: P(r,t) = P(r)P(t|r)$
* 2) Eliminate/marginalize. Take hidden variables out
* Projection operation
## Variable Elimination
* NP-hard but tends to be much faster
* Cannot solve 3-SAT
* Inference by enumeration: $\sum\limits_t \sum\limits_r P(L|t)P(r)P(t|r)$
* Variable elimination $\sum\limits_t P(L|t) \sum\limits_r P(r)P(t|r)$
* Interleave joins and elimination
* Eliminate all hidden variables
* Normalize at the end
* Moving summations as right as possible
* Elimination order matters
* Doesn't always exist an ordering with small factors
# Lecture 16: 10/29/19
* Independence in a BN
* Yes: Prove using algebra
* No: Give a counterexample
* Typically look for independence relations from the graph structure (the numbers could just work out that way)
## Independence Properties
* Study independence properties for triples
* For 1: Never independent of self
* For 2: Just look if there is an edge
* For 3+: Can be reduced to triples of these canonical chains
* 1) Causal Chains
* $P(x,y,z) = P(x)P(y|x)P(z|y)$
* Not guaranteed $X \perp\!\!\!\perp Z$
* Guaranteed $X \perp\!\!\!\perp Z | Y$
* 2) Common Cause
* $P(x,y,z) = P(y)P(x|y)P(z|y)$
* Not guaranteed $X \perp\!\!\!\perp Z$
* Guaranteed $X \perp\!\!\!\perp Z | Y$
* 3) Common Effect
* $P(x,y,z) = P(x)P(y)P(z|x,y)$
* Guaranteed $X \perp\!\!\!\perp Z$
* NOT guaranteed $X \perp\!\!\!\perp Z | Y$
* Observing an effect activates influence between causes
## D-Separation
* Reachability is not enough (fails for common effects)
* $X \perp\!\!\!\perp Y | Z$ if X and Y are d-separated by Z
* Consider all undirected paths from X to Y
* No active paths = independent
* Active path if each triple is active
* Causal chain: A->B->C where B is unobserved
* Common cause: A <- B -> C where B is unobserved
* Common effect: A -> B <- C where B or one of its descendents is OBSERVED
* A single inactive segment can block a path
* If 1 or more path is active, independence not guaranteed
* If all paths are inactive, independence guaranteed
## Topology Limits Distributions
* Fully conditioning can encode any distribution
# Lecture 17: 10/31/19
## Sampling
* Repeated simulations
* Draw N samples from a sampling distribution S
* Compute an approximate posterior probability which converges to true probability P
* 1) Get sample u from uniform distribution $[0, 1)$
* 2) Divide the interval into subintervals equal to size of probability
## Prior Sampling
* 1) Run through nodes in topological order
* Sample $x_i$ from $P(X_i | Parents(X_i))$
* 2) Return ($x_1, x_2, \ldots x_n$)
* Sampling procedure is consistent
* If we do procedure long enough, it converges
* Monte Carlo Estimation: Estimate expected value of f(X)
## Rejection Sampling
* Only consider samples that match evidence
* Input: evidence instantiation
* 1) Run through nodes in topological order
* Sample $x_i$ from $P(X_i | Parents(X_i))$
* 2) If $x_i$ not consistent with evidence, reject it
* 3) Return ($x_1, x_2, \ldots x_n$)
* Sampling procedure is consistent
* Problem: If evidence is unlikely, reject lots of samples
## Likelihood Weighting
* Fix evidence variables, sample the rest
* Problem: Sample distribution is not consistent
* Solution: Weight by probability of evidence given parents
* Algorithm
* 1) Instantiate weights to be 1
* 2) Run through nodes in topological order
* If $x_i$ is an evidence variable,
* $X_i =$ observation of $x_i$ for $X_i$
* Set $w = w * P(X_i | Parents(X_i))$
* Else: $x_i$ from $P(X_i | Parents(X_i))$
* 3) Return ($x_1, x_2, \ldots x_n$), w
* Problem: Evidence influences choice of downstream variables, but not upstream variables
## Gibbs Sampling
* 1) Fix evidence variables
* 2) Start with random instantiation.
* 3) Resample X from P(X | rest of variables)
* Sample 1 variable at a time conditioned on the rest
* Idea: No hidden variables
# Lecture 18: 11/5/19
## Decision Networks
* Maximize Expected Utility (MEU)
* Bayes net with nodes for utility
* Calculate expected utility for each action
* Chances nodes, action nodes, utility nodes
* Procedure
* 1) Instantiate all evidencce
* 2) Select action nodes along the way
* 3) Calculate posterior for all parents
* 4) Calculate expected utility for each action
* 5) Choose maximizing action
* Value of information = expected gain in MEU with new info
* When you get evidence, your MEU might be less
* But now you can make the correct decision knowing that life sucks
## Value of Perfect Information
* MEU(F=bad) + MEU(F=good) - MEU(none)
* $VPI(E' | e) = \left(\sum\limits_{e'}P(e'|e)MEU(e,e') \right) - MEU(e)$
* $MEU(e) = \max\limits_a \sum\limits_sP(s|e)U(s,a)$
* Properties
* VPI is always nonnegative
* VPI is nonadditive
* $VPI(E_j, E_k | e) \neq VPI(E_j|e) + VPI(E_k|e)$
* VPI is order independent
* $VPI(E_j, E_k | e) = VPI(E_j|e) + VPI(E_k|e, e_J)$ in any order
# Lecture 19: 11/7/19
## POMDPs
* Observation function $O(s,o) = P(o|s)$
* POMDPs are MDPs over belief states b
* One way to solve: Use truncated expectimax for approximate values of actions
* VPI based agent
## Hidden Markov Models
* Reasoning about a sequence of observations over time
* Value of X at a given time is called the state
* Parameters: transition probabilities/dynamics show how state evolves over time
* Stationary assumption: transition probabilities the same at all times
* Initial distribution $P(X_1)$
* Transitions: $P(X_t | X_{t-1})$
* Emissions: $P(E_t, X_t)$
* Markov hidden process: Future depends on past via the present
* Current observation is independent of everything else given current state
## Mini-Forward Algorithm
* Find $P(X)$ on some day t
* $P(x_t) = \sum\limits_{x_{t-1}} P(x_t | x_{t-1})P(x_{t-1})$
* Can converge as $t \to \infty$
* Stationary distribution: $P_\infty = P_{\infty + 1}$
# Lecture 20: 11/12/19
* Localization: Only 1 sensor reading could be wrong. Compute probability distribution of where robot can be
* $B(X_T) = P(X_T|e_1\colon t)$
## Inference (Passage of time)
* Current belief $B(X_T) = P(X_T|e_{1\colon t})$
* After 1 time step: $P(X_{t+1}|e_{1 \colon t})$
* Prediction of new time step from old evidence
* $B'(X_{t+1}) = \sum_{x_t} P(X'|x_t)B(x_t)$
* No $e_{t+1}$
## Observation
* We have current belief $B'(X_{t+1}) = \sum_{x_t} P(X'|x_t)B(x_t)$
* New evidence comes in: $P(X_{t+1}|e_{1\colon t+1}) \propto P(X_{t+1}, e_{t}|e_{1\colon t+1})$
* $= P(e_{t+1}|e_{1\colon t}, X_{t+1})P(X_{t+1}|e_{1\colon t})$
* Or compactly, $P(X_{t+1}|e_{1\colon t+1}) \propto P(e_{t+1}|X_{t+1})B'(X_{t+1})$
* Must renormalize
### Forward Algorithm
* $P(x_t|e_{1\colon t}) = P(e_t | x_T) \sum\limits_{x_{t-1}} P(x_t|x_{t-1})P(x_{t-1}, e_{1\colon t-1})$
## Particle Filtering
* Just use sampling instead of dealing with exact beliefs
* Good if $|X|$ is too big to store in memory (if X is continuous)
* Track samples of X, not all values
* Time per step is linear in the number of samples
* Generall N particles << $|X|$
* $x' = sample(P(X'|x))$
* Captures passage of time
* Observation:
* Compute a weight for each particle of how likely is based on evidence
* $B(X) \propto P(e|X)B'(X)$
* Resample:
* Resample particles to go back to all weight 1 particles
* Resample N times from weighted sample distribution (equivalent to renormalizing)
## Dynamic Bayes Nets (DBNs)
* Track multiple variables over time using multiple evidence
* Idea: Repeat a fixed Bayes net structure at each time
# Lecture 21: 11/14/19
## Classification
* Spam vs ham prediction of new future emails
* Features:
* Text Patterns, words
* Other features unrelated to email content: Sender in contacts
* Ex. Digit recognition
* Get a large collection of example images, labeled
* Want to learn to predict labels of new, future digit images
* Features: pixels, num components, aspect ratio, numloops
## Model based classification
* Naive bayes: Assume all features are independent of label
* Inference: Get joining probability distribution
* Count the probability that each pixel is on for that label
* Bag of words model: Features = Wi is word at position i
* Insensitive to word order or reordering
* Weight each word along the way to give a total score for spam vs ham
## Training vs Test
* Training set, Held out set, test data
* Use test data to get a sense of accuracy in the real world
* Features: attribute-value pairs to characterize each x
* Never peek at the test set
* Evaluation:
* Accuracy: fraction of instances predicted correctly
* Don't want false positives
* Friend classified as a gorilla
* Overfitting:
* When you try to maximize accuracy on training data
* Odds ratio of infinity blows up
## Parameter Estimation
* Estimating the distribution of a RV
* Elicitation: ask a human
* Empirically: use training data (learning)
### Maximum Likelihood Estimation
* Data: Observed set D of heads and tails
* Hypothesis space: binomial distributions
* Learning: Finding $\theta$ is an optimization problem
* MLE: $\hat{\theta} = arg\max_{\theta} \ln \theta^{\alpha_H} (1 - \theta)^{\alpha_T}$
* $ = \frac{\alpha_H}{\alpha_H + \alpha_T}$
* Another option is to consider the most likely parameter value given the data
## Laplace Smoothing
* Pretend that you saw every outcome once more than you actually did
* Avoids the division by 0 problem
* Extended laplace estimate: pretend we saw everything $k$ times
* Tuning on held out data
* Learn parameters from training data
* Tune hyperparamaters on different data
* For each value of hyperparameters, train and test on held out data
* Choose the best value, do a final test on test data
# Lecture 22: 11/19/19
* Reinforcement Learning vs Supervised Learning
* RL finds policy - mapping from states to actions
## Perceptron
* Inputs are feature vectors
* Each feature has a weight, sum is the activation
* Activation(x) = $w \cdot f(x)$
* Positive outputs +1, negative output -1
* Weight updates
* Start with weights = 0
* For each training instance:
* Classify current weights
* If correct, don't do anything. If wrong, adjust the weight vector ($w = w + y^* \cdot f$)
* Add/subtract the feature vector
## Multi-Class Perceptron
* For multiclass, there is a weight vector for each class
* Score of class y: $w_y \cdot f(x)$
* Prediction highest score wins: $y = arg \max\limits_y w_y \cdot f(x)$
* Algorithm:
* Start with weights all 0
* Pick up training examples 1 by 1
* Predict with current weights
* If wrong, $w_y = w_y - f(x)$, $w_y^* = w_y^* + f(x)$
## Properties
* Separability: true if some parameters get the training set perfectly correct
* Convergence: if the training is separable, perceptron will eventually converge (binary case)
* Mistake bound: $\frac{k}{\delta^2}$
## Problems
* If data is not separable, weights might thrash
* Averaged perceptron can help
* Mediocre generalization sometimes barely finds the solution
* Overtraining/overfitting
* Non-separable case: use probabilistic decision
* Very positive score, probability close to 1
* Sigmoid function: $\phi(z) = \frac{1}{1+e^{-z}}$
## MLE
* $\max\limits_w ll(w) = \max\limits_w \sum\limits_i \log P(y^{(i)} | x^{(i)}, w)$
# Lecture 23: 11/21/19
## Gradient Ascent
* Perform update in uphill direction for each coordinate
* For $g(w_1, w_2)$
* $w_1 \leftarrow w_1 + \alpha \frac{\partial g}{\partial w_1}(w_1,w_2)$
* $w \leftarrow w + \alpha \nabla_w g(w)$
* $g(w) = \sum\limits_i \log P(y^{(i)} | x^{(i)}, w)$
* Keep doing this until update changes $w$ by around $0.1 - 1\%$
* Can't make $\alpha$ too big
* Mini-batch gradient ascent: compute gradients for points in parallel
* Stop when log likelihood of hold-out data begins to decrease
## Neural Networks
* 2 layer neural network can approximate any continuous function
* Last year = still logistic regression
* More layers before this last layer
* Feasure are learned rather than hand-designed
# Lecture 24: 12/3/19
## Deep Neural Networks
* Softmax: $P(y_1| x; w) = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}$
* Deep neural network: $\max\limits_w ll(w) = \max\limits_w \sum\limits_i \log P(y^{(i)}| x^{(i)}; w)$
* w is just a much larger vector
* Run gradient ascent, stop when log likelihood of hold out data starts to decrease
* Covariate shift: correlations present in train data not in test data
## Inverse RL
* Usually we try to find an optimal policy
* Inverse planning: We see optimal behavior and try to recover the reward function that demonstrates the reward behavior
* Reward engineering is hard, try demonstrating behavior and extract reward function
* Aka inverse planning, inverse optimal control
* More causal model of learning
## IRL Formulation
* Given $\xi_D$, find $R(s,a) = \theta^T \phi(s,a)$ such that $R(\xi_D) \geq \max\limits_\xi R(\xi)$
* But 0/constant reward ends up being a solution
* Maximum Margin Planning
* $R(\xi_D) \geq \max\limits_\xi R[\xi + l(\xi, \xi_D)]$
* Search for a reward function giving a bonus to trajectories farther from the demonstration
* Makes it a little more difficult to pick the demonstration
* $\max\limits_\theta[\theta^T \phi(\xi_D) - \max\limits_\xi [\theta^T \phi(\xi) + l(\xi, \xi_D)]]$
* $\xi_\theta^*$ is arg max
* Taking gradient is hard because there's a max nested
* Take a subgradient: Pick different directions
* $\nabla_\theta = \phi(\xi_D) - \phi(\xi^*_\theta)$
* What if the demonstrator is not optimal?
* Bayesian view: $P(\xi_D | \theta)$, run inference on $\theta$
* Use softmax: $P(\xi_D | \theta) \propto e^{\beta \theta^T \phi(\xi_D)}$
* Bayesian belief update using demonstration as evidence
* Maximum Entropy RL just looks at maximum likelihood estimate