# CS188 Notes Part 2 # Lecture 13: 10/15/19 ## Uncertainty * Observed variables: Agent knows things about the state (sensor readings) * Unobserved variables: Agent reasons about other aspects (where an object is) * Random variables * Note that $+r$ means True, $-r$ means False * Probability distributions * Marginal distributions: sub-tables that eliminiate variables * "Summing out" is marginalization over a variable * Conditional probabilities * $P(a|b) = \frac{P(a,b)}{P(b)}$ * Select joint probabilities matching the evidence * Then normalize the section ## Probabilistic Inference * Compute a desired probability from other known probabilities (ex. conditional from joint) * Probabilities change with new (given) evidence * Inference by enumeration * Evidence variables: $E_1, \ldots E_k$ * Query variables: $Q$ * Hidden variables: $H_1, \ldots H_r$ * We want $P(Q | e_1, \ldots e_k)$ * Steps * 1) Select entries consistent with evidence * 2) Sum out H to get joint of Query and evidence * 3) Normalize ($\frac{1}{P(Q)}$) * Time: $O(d^n)$ * Space: $O(d^n)$ * Sometimes have conditional dist but want joint * Product rule: $P(y)P(x|y) = P(x,y)$ * Chain rule: $P(x_1, x_2, \ldots x_n) = \Pi_i P(x_i | x_1 \ldots x_{i-1})$ * Bayes' Rule: $P(x,y) = P(x|y)P(y) = P(y|x)P(x)$ * $P(y) = \sum\limits_{x'} P(y|x')P(x')$ # Lecture 14: 10/22/19 ## Bayes' Nets * Models are always simplifications of the world * Use probabilistic models to reason about unknown variables * Absolute/unconditional independence * If $X \perp\!\!\!\perp Y, P(x,y) = P(x)P(y)$ * Conditional independence * $\forall x,y,z:P(x, y | z) = P(x | z) P(y | z)$ * Equivalently, $P(x|z, y) = P(x| z)$ * Joint distribution tables are too large to explicitly represent * Hard to learn anything empirically more than a few variables at a time * Bayes' Net: technique for describing complex joint distribution (model) using conditional probabilities * Graphical model, how variables locally interact ### Graphical Model Notation * Bayes net = DAG + conditional probability table (CPT) * $P(X | a_1, \ldots a_n)$ * Nodes: 1 per variable (with domains) * Can be assigned (observed) or unassigned (unobserved) * Arcs: interactions/direct influence between variables * Encode conditional independence * Absolute independence = no interactions between variables * Implicitly encode joint distributions * Product of local conditional distributions * $P(x_1, x_2, \ldots x_N) = \Pi^{n}_{i=1}P(x_i|parents(X_i))$ * Not every BN can represent every joint distribution # Lecture 15: 10/24/18 ## Inference by Enumeration in BN * Evidence variables: Evidence you have observed * Query variables: Probability you want * Hidden variables: Not observed * $P(B | +j, +m) \propto_B P(B, +j, +m)$ * $ = \sum\limits_{e,a}P(B, e, a, +j, +m)$ * $ = \sum\limits_{e,a} P(B)P(e)P(a|B,e)P(+j|a)P(+m|a)$ * Computing the joint distribution is time and memory intensive * Track factors called factors * Procedure: Join all factors, eliminate hidden variables, normalize * 1) Join: $\forall r,t: P(r,t) = P(r)P(t|r)$ * 2) Eliminate/marginalize. Take hidden variables out * Projection operation ## Variable Elimination * NP-hard but tends to be much faster * Cannot solve 3-SAT * Inference by enumeration: $\sum\limits_t \sum\limits_r P(L|t)P(r)P(t|r)$ * Variable elimination $\sum\limits_t P(L|t) \sum\limits_r P(r)P(t|r)$ * Interleave joins and elimination * Eliminate all hidden variables * Normalize at the end * Moving summations as right as possible * Elimination order matters * Doesn't always exist an ordering with small factors # Lecture 16: 10/29/19 * Independence in a BN * Yes: Prove using algebra * No: Give a counterexample * Typically look for independence relations from the graph structure (the numbers could just work out that way) ## Independence Properties * Study independence properties for triples * For 1: Never independent of self * For 2: Just look if there is an edge * For 3+: Can be reduced to triples of these canonical chains * 1) Causal Chains * $P(x,y,z) = P(x)P(y|x)P(z|y)$ * Not guaranteed $X \perp\!\!\!\perp Z$ * Guaranteed $X \perp\!\!\!\perp Z | Y$ * 2) Common Cause * $P(x,y,z) = P(y)P(x|y)P(z|y)$ * Not guaranteed $X \perp\!\!\!\perp Z$ * Guaranteed $X \perp\!\!\!\perp Z | Y$ * 3) Common Effect * $P(x,y,z) = P(x)P(y)P(z|x,y)$ * Guaranteed $X \perp\!\!\!\perp Z$ * NOT guaranteed $X \perp\!\!\!\perp Z | Y$ * Observing an effect activates influence between causes ## D-Separation * Reachability is not enough (fails for common effects) * $X \perp\!\!\!\perp Y | Z$ if X and Y are d-separated by Z * Consider all undirected paths from X to Y * No active paths = independent * Active path if each triple is active * Causal chain: A->B->C where B is unobserved * Common cause: A <- B -> C where B is unobserved * Common effect: A -> B <- C where B or one of its descendents is OBSERVED * A single inactive segment can block a path * If 1 or more path is active, independence not guaranteed * If all paths are inactive, independence guaranteed ## Topology Limits Distributions * Fully conditioning can encode any distribution # Lecture 17: 10/31/19 ## Sampling * Repeated simulations * Draw N samples from a sampling distribution S * Compute an approximate posterior probability which converges to true probability P * 1) Get sample u from uniform distribution $[0, 1)$ * 2) Divide the interval into subintervals equal to size of probability ## Prior Sampling * 1) Run through nodes in topological order * Sample $x_i$ from $P(X_i | Parents(X_i))$ * 2) Return ($x_1, x_2, \ldots x_n$) * Sampling procedure is consistent * If we do procedure long enough, it converges * Monte Carlo Estimation: Estimate expected value of f(X) ## Rejection Sampling * Only consider samples that match evidence * Input: evidence instantiation * 1) Run through nodes in topological order * Sample $x_i$ from $P(X_i | Parents(X_i))$ * 2) If $x_i$ not consistent with evidence, reject it * 3) Return ($x_1, x_2, \ldots x_n$) * Sampling procedure is consistent * Problem: If evidence is unlikely, reject lots of samples ## Likelihood Weighting * Fix evidence variables, sample the rest * Problem: Sample distribution is not consistent * Solution: Weight by probability of evidence given parents * Algorithm * 1) Instantiate weights to be 1 * 2) Run through nodes in topological order * If $x_i$ is an evidence variable, * $X_i =$ observation of $x_i$ for $X_i$ * Set $w = w * P(X_i | Parents(X_i))$ * Else: $x_i$ from $P(X_i | Parents(X_i))$ * 3) Return ($x_1, x_2, \ldots x_n$), w * Problem: Evidence influences choice of downstream variables, but not upstream variables ## Gibbs Sampling * 1) Fix evidence variables * 2) Start with random instantiation. * 3) Resample X from P(X | rest of variables) * Sample 1 variable at a time conditioned on the rest * Idea: No hidden variables # Lecture 18: 11/5/19 ## Decision Networks * Maximize Expected Utility (MEU) * Bayes net with nodes for utility * Calculate expected utility for each action * Chances nodes, action nodes, utility nodes * Procedure * 1) Instantiate all evidencce * 2) Select action nodes along the way * 3) Calculate posterior for all parents * 4) Calculate expected utility for each action * 5) Choose maximizing action * Value of information = expected gain in MEU with new info * When you get evidence, your MEU might be less * But now you can make the correct decision knowing that life sucks ## Value of Perfect Information * MEU(F=bad) + MEU(F=good) - MEU(none) * $VPI(E' | e) = \left(\sum\limits_{e'}P(e'|e)MEU(e,e') \right) - MEU(e)$ * $MEU(e) = \max\limits_a \sum\limits_sP(s|e)U(s,a)$ * Properties * VPI is always nonnegative * VPI is nonadditive * $VPI(E_j, E_k | e) \neq VPI(E_j|e) + VPI(E_k|e)$ * VPI is order independent * $VPI(E_j, E_k | e) = VPI(E_j|e) + VPI(E_k|e, e_J)$ in any order # Lecture 19: 11/7/19 ## POMDPs * Observation function $O(s,o) = P(o|s)$ * POMDPs are MDPs over belief states b * One way to solve: Use truncated expectimax for approximate values of actions * VPI based agent ## Hidden Markov Models * Reasoning about a sequence of observations over time * Value of X at a given time is called the state * Parameters: transition probabilities/dynamics show how state evolves over time * Stationary assumption: transition probabilities the same at all times * Initial distribution $P(X_1)$ * Transitions: $P(X_t | X_{t-1})$ * Emissions: $P(E_t, X_t)$ * Markov hidden process: Future depends on past via the present * Current observation is independent of everything else given current state ## Mini-Forward Algorithm * Find $P(X)$ on some day t * $P(x_t) = \sum\limits_{x_{t-1}} P(x_t | x_{t-1})P(x_{t-1})$ * Can converge as $t \to \infty$ * Stationary distribution: $P_\infty = P_{\infty + 1}$ # Lecture 20: 11/12/19 * Localization: Only 1 sensor reading could be wrong. Compute probability distribution of where robot can be * $B(X_T) = P(X_T|e_1\colon t)$ ## Inference (Passage of time) * Current belief $B(X_T) = P(X_T|e_{1\colon t})$ * After 1 time step: $P(X_{t+1}|e_{1 \colon t})$ * Prediction of new time step from old evidence * $B'(X_{t+1}) = \sum_{x_t} P(X'|x_t)B(x_t)$ * No $e_{t+1}$ ## Observation * We have current belief $B'(X_{t+1}) = \sum_{x_t} P(X'|x_t)B(x_t)$ * New evidence comes in: $P(X_{t+1}|e_{1\colon t+1}) \propto P(X_{t+1}, e_{t}|e_{1\colon t+1})$ * $= P(e_{t+1}|e_{1\colon t}, X_{t+1})P(X_{t+1}|e_{1\colon t})$ * Or compactly, $P(X_{t+1}|e_{1\colon t+1}) \propto P(e_{t+1}|X_{t+1})B'(X_{t+1})$ * Must renormalize ### Forward Algorithm * $P(x_t|e_{1\colon t}) = P(e_t | x_T) \sum\limits_{x_{t-1}} P(x_t|x_{t-1})P(x_{t-1}, e_{1\colon t-1})$ ## Particle Filtering * Just use sampling instead of dealing with exact beliefs * Good if $|X|$ is too big to store in memory (if X is continuous) * Track samples of X, not all values * Time per step is linear in the number of samples * Generall N particles << $|X|$ * $x' = sample(P(X'|x))$ * Captures passage of time * Observation: * Compute a weight for each particle of how likely is based on evidence * $B(X) \propto P(e|X)B'(X)$ * Resample: * Resample particles to go back to all weight 1 particles * Resample N times from weighted sample distribution (equivalent to renormalizing) ## Dynamic Bayes Nets (DBNs) * Track multiple variables over time using multiple evidence * Idea: Repeat a fixed Bayes net structure at each time # Lecture 21: 11/14/19 ## Classification * Spam vs ham prediction of new future emails * Features: * Text Patterns, words * Other features unrelated to email content: Sender in contacts * Ex. Digit recognition * Get a large collection of example images, labeled * Want to learn to predict labels of new, future digit images * Features: pixels, num components, aspect ratio, numloops ## Model based classification * Naive bayes: Assume all features are independent of label * Inference: Get joining probability distribution * Count the probability that each pixel is on for that label * Bag of words model: Features = Wi is word at position i * Insensitive to word order or reordering * Weight each word along the way to give a total score for spam vs ham ## Training vs Test * Training set, Held out set, test data * Use test data to get a sense of accuracy in the real world * Features: attribute-value pairs to characterize each x * Never peek at the test set * Evaluation: * Accuracy: fraction of instances predicted correctly * Don't want false positives * Friend classified as a gorilla * Overfitting: * When you try to maximize accuracy on training data * Odds ratio of infinity blows up ## Parameter Estimation * Estimating the distribution of a RV * Elicitation: ask a human * Empirically: use training data (learning) ### Maximum Likelihood Estimation * Data: Observed set D of heads and tails * Hypothesis space: binomial distributions * Learning: Finding $\theta$ is an optimization problem * MLE: $\hat{\theta} = arg\max_{\theta} \ln \theta^{\alpha_H} (1 - \theta)^{\alpha_T}$ * $ = \frac{\alpha_H}{\alpha_H + \alpha_T}$ * Another option is to consider the most likely parameter value given the data ## Laplace Smoothing * Pretend that you saw every outcome once more than you actually did * Avoids the division by 0 problem * Extended laplace estimate: pretend we saw everything $k$ times * Tuning on held out data * Learn parameters from training data * Tune hyperparamaters on different data * For each value of hyperparameters, train and test on held out data * Choose the best value, do a final test on test data # Lecture 22: 11/19/19 * Reinforcement Learning vs Supervised Learning * RL finds policy - mapping from states to actions ## Perceptron * Inputs are feature vectors * Each feature has a weight, sum is the activation * Activation(x) = $w \cdot f(x)$ * Positive outputs +1, negative output -1 * Weight updates * Start with weights = 0 * For each training instance: * Classify current weights * If correct, don't do anything. If wrong, adjust the weight vector ($w = w + y^* \cdot f$) * Add/subtract the feature vector ## Multi-Class Perceptron * For multiclass, there is a weight vector for each class * Score of class y: $w_y \cdot f(x)$ * Prediction highest score wins: $y = arg \max\limits_y w_y \cdot f(x)$ * Algorithm: * Start with weights all 0 * Pick up training examples 1 by 1 * Predict with current weights * If wrong, $w_y = w_y - f(x)$, $w_y^* = w_y^* + f(x)$ ## Properties * Separability: true if some parameters get the training set perfectly correct * Convergence: if the training is separable, perceptron will eventually converge (binary case) * Mistake bound: $\frac{k}{\delta^2}$ ## Problems * If data is not separable, weights might thrash * Averaged perceptron can help * Mediocre generalization sometimes barely finds the solution * Overtraining/overfitting * Non-separable case: use probabilistic decision * Very positive score, probability close to 1 * Sigmoid function: $\phi(z) = \frac{1}{1+e^{-z}}$ ## MLE * $\max\limits_w ll(w) = \max\limits_w \sum\limits_i \log P(y^{(i)} | x^{(i)}, w)$ # Lecture 23: 11/21/19 ## Gradient Ascent * Perform update in uphill direction for each coordinate * For $g(w_1, w_2)$ * $w_1 \leftarrow w_1 + \alpha \frac{\partial g}{\partial w_1}(w_1,w_2)$ * $w \leftarrow w + \alpha \nabla_w g(w)$ * $g(w) = \sum\limits_i \log P(y^{(i)} | x^{(i)}, w)$ * Keep doing this until update changes $w$ by around $0.1 - 1\%$ * Can't make $\alpha$ too big * Mini-batch gradient ascent: compute gradients for points in parallel * Stop when log likelihood of hold-out data begins to decrease ## Neural Networks * 2 layer neural network can approximate any continuous function * Last year = still logistic regression * More layers before this last layer * Feasure are learned rather than hand-designed # Lecture 24: 12/3/19 ## Deep Neural Networks * Softmax: $P(y_1| x; w) = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}$ * Deep neural network: $\max\limits_w ll(w) = \max\limits_w \sum\limits_i \log P(y^{(i)}| x^{(i)}; w)$ * w is just a much larger vector * Run gradient ascent, stop when log likelihood of hold out data starts to decrease * Covariate shift: correlations present in train data not in test data ## Inverse RL * Usually we try to find an optimal policy * Inverse planning: We see optimal behavior and try to recover the reward function that demonstrates the reward behavior * Reward engineering is hard, try demonstrating behavior and extract reward function * Aka inverse planning, inverse optimal control * More causal model of learning ## IRL Formulation * Given $\xi_D$, find $R(s,a) = \theta^T \phi(s,a)$ such that $R(\xi_D) \geq \max\limits_\xi R(\xi)$ * But 0/constant reward ends up being a solution * Maximum Margin Planning * $R(\xi_D) \geq \max\limits_\xi R[\xi + l(\xi, \xi_D)]$ * Search for a reward function giving a bonus to trajectories farther from the demonstration * Makes it a little more difficult to pick the demonstration * $\max\limits_\theta[\theta^T \phi(\xi_D) - \max\limits_\xi [\theta^T \phi(\xi) + l(\xi, \xi_D)]]$ * $\xi_\theta^*$ is arg max * Taking gradient is hard because there's a max nested * Take a subgradient: Pick different directions * $\nabla_\theta = \phi(\xi_D) - \phi(\xi^*_\theta)$ * What if the demonstrator is not optimal? * Bayesian view: $P(\xi_D | \theta)$, run inference on $\theta$ * Use softmax: $P(\xi_D | \theta) \propto e^{\beta \theta^T \phi(\xi_D)}$ * Bayesian belief update using demonstration as evidence * Maximum Entropy RL just looks at maximum likelihood estimate