CS188 Notes Part 2

Lecture 13: 10/15/19

Uncertainty

  • Observed variables: Agent knows things about the state (sensor readings)
  • Unobserved variables: Agent reasons about other aspects (where an object is)
  • Random variables
    • Note that
      +r
      means True,
      r
      means False
  • Probability distributions
  • Marginal distributions: sub-tables that eliminiate variables
    • "Summing out" is marginalization over a variable
  • Conditional probabilities
    • P(a|b)=P(a,b)P(b)
    • Select joint probabilities matching the evidence
    • Then normalize the section

Probabilistic Inference

  • Compute a desired probability from other known probabilities (ex. conditional from joint)
  • Probabilities change with new (given) evidence
  • Inference by enumeration
    • Evidence variables:
      E1,Ek
    • Query variables:
      Q
    • Hidden variables:
      H1,Hr
  • We want
    P(Q|e1,ek)
  • Steps
      1. Select entries consistent with evidence
      1. Sum out H to get joint of Query and evidence
      1. Normalize (
        1P(Q)
        )
  • Time:
    O(dn)
  • Space:
    O(dn)
  • Sometimes have conditional dist but want joint
  • Product rule:
    P(y)P(x|y)=P(x,y)
  • Chain rule:
    P(x1,x2,xn)=ΠiP(xi|x1xi1)
  • Bayes' Rule:
    P(x,y)=P(x|y)P(y)=P(y|x)P(x)
  • P(y)=xP(y|x)P(x)

Lecture 14: 10/22/19

Bayes' Nets

  • Models are always simplifications of the world
  • Use probabilistic models to reason about unknown variables
  • Absolute/unconditional independence
    • If
      XY,P(x,y)=P(x)P(y)
  • Conditional independence
    • x,y,z:P(x,y|z)=P(x|z)P(y|z)
    • Equivalently,
      P(x|z,y)=P(x|z)
  • Joint distribution tables are too large to explicitly represent
  • Hard to learn anything empirically more than a few variables at a time
  • Bayes' Net: technique for describing complex joint distribution (model) using conditional probabilities
    • Graphical model, how variables locally interact

Graphical Model Notation

  • Bayes net = DAG + conditional probability table (CPT)
    • P(X|a1,an)
  • Nodes: 1 per variable (with domains)
    • Can be assigned (observed) or unassigned (unobserved)
  • Arcs: interactions/direct influence between variables
    • Encode conditional independence
  • Absolute independence = no interactions between variables
  • Implicitly encode joint distributions
    • Product of local conditional distributions
    • P(x1,x2,xN)=Πi=1nP(xi|parents(Xi))
    • Not every BN can represent every joint distribution

Lecture 15: 10/24/18

Inference by Enumeration in BN

  • Evidence variables: Evidence you have observed
  • Query variables: Probability you want
  • Hidden variables: Not observed
  • P(B|+j,+m)BP(B,+j,+m)
    • $ = \sum\limits_{e,a}P(B, e, a, +j, +m)$
    • $ = \sum\limits_{e,a} P(B)P(e)P(a|B,e)P(+j|a)P(+m|a)$
  • Computing the joint distribution is time and memory intensive
  • Track factors called factors
  • Procedure: Join all factors, eliminate hidden variables, normalize
      1. Join:
        r,t:P(r,t)=P(r)P(t|r)
      1. Eliminate/marginalize. Take hidden variables out
      • Projection operation

Variable Elimination

  • NP-hard but tends to be much faster
    • Cannot solve 3-SAT
  • Inference by enumeration:
    trP(L|t)P(r)P(t|r)
  • Variable elimination
    tP(L|t)rP(r)P(t|r)
    • Interleave joins and elimination
  • Eliminate all hidden variables
  • Normalize at the end
  • Moving summations as right as possible
  • Elimination order matters
    • Doesn't always exist an ordering with small factors

Lecture 16: 10/29/19

  • Independence in a BN
    • Yes: Prove using algebra
    • No: Give a counterexample
  • Typically look for independence relations from the graph structure (the numbers could just work out that way)

Independence Properties

  • Study independence properties for triples
    • For 1: Never independent of self
    • For 2: Just look if there is an edge
    • For 3+: Can be reduced to triples of these canonical chains
    1. Causal Chains
    • P(x,y,z)=P(x)P(y|x)P(z|y)
    • Not guaranteed
      XZ
    • Guaranteed
      XZ|Y
    1. Common Cause
    • P(x,y,z)=P(y)P(x|y)P(z|y)
    • Not guaranteed
      XZ
    • Guaranteed
      XZ|Y
    1. Common Effect
    • P(x,y,z)=P(x)P(y)P(z|x,y)
    • Guaranteed
      XZ
    • NOT guaranteed
      XZ|Y
    • Observing an effect activates influence between causes

D-Separation

  • Reachability is not enough (fails for common effects)
  • XY|Z
    if X and Y are d-separated by Z
  • Consider all undirected paths from X to Y
  • No active paths = independent
  • Active path if each triple is active
    • Causal chain: A->B->C where B is unobserved
    • Common cause: A <- B -> C where B is unobserved
    • Common effect: A -> B <- C where B or one of its descendents is OBSERVED
  • A single inactive segment can block a path
  • If 1 or more path is active, independence not guaranteed
  • If all paths are inactive, independence guaranteed

Topology Limits Distributions

  • Fully conditioning can encode any distribution

Lecture 17: 10/31/19

Sampling

  • Repeated simulations
  • Draw N samples from a sampling distribution S
  • Compute an approximate posterior probability which converges to true probability P
    1. Get sample u from uniform distribution
      [0,1)
    1. Divide the interval into subintervals equal to size of probability

Prior Sampling

    1. Run through nodes in topological order
    • Sample
      xi
      from
      P(Xi|Parents(Xi))
    1. Return (
      x1,x2,xn
      )
  • Sampling procedure is consistent
    • If we do procedure long enough, it converges
  • Monte Carlo Estimation: Estimate expected value of f(X)

Rejection Sampling

  • Only consider samples that match evidence
  • Input: evidence instantiation
    1. Run through nodes in topological order
    • Sample
      xi
      from
      P(Xi|Parents(Xi))
    1. If
      xi
      not consistent with evidence, reject it
    1. Return (
      x1,x2,xn
      )
  • Sampling procedure is consistent
  • Problem: If evidence is unlikely, reject lots of samples

Likelihood Weighting

  • Fix evidence variables, sample the rest
  • Problem: Sample distribution is not consistent
  • Solution: Weight by probability of evidence given parents
  • Algorithm
      1. Instantiate weights to be 1
      1. Run through nodes in topological order
      • If
        xi
        is an evidence variable,
      • Xi=
        observation of
        xi
        for
        Xi
      • Set
        w=wP(Xi|Parents(Xi))
    • Else:
      xi
      from
      P(Xi|Parents(Xi))
      1. Return (
        x1,x2,xn
        ), w
  • Problem: Evidence influences choice of downstream variables, but not upstream variables

Gibbs Sampling

    1. Fix evidence variables
    1. Start with random instantiation.
    1. Resample X from P(X | rest of variables)
    • Sample 1 variable at a time conditioned on the rest
    • Idea: No hidden variables

Lecture 18: 11/5/19

Decision Networks

  • Maximize Expected Utility (MEU)
  • Bayes net with nodes for utility
  • Calculate expected utility for each action
  • Chances nodes, action nodes, utility nodes
  • Procedure
      1. Instantiate all evidencce
      1. Select action nodes along the way
      1. Calculate posterior for all parents
      1. Calculate expected utility for each action
      1. Choose maximizing action
  • Value of information = expected gain in MEU with new info
  • When you get evidence, your MEU might be less
    • But now you can make the correct decision knowing that life sucks

Value of Perfect Information

  • MEU(F=bad) + MEU(F=good) - MEU(none)
  • VPI(E|e)=(eP(e|e)MEU(e,e))MEU(e)
  • MEU(e)=maxasP(s|e)U(s,a)
  • Properties
    • VPI is always nonnegative
    • VPI is nonadditive
      • VPI(Ej,Ek|e)VPI(Ej|e)+VPI(Ek|e)
    • VPI is order independent
      • VPI(Ej,Ek|e)=VPI(Ej|e)+VPI(Ek|e,eJ)
        in any order

Lecture 19: 11/7/19

POMDPs

  • Observation function
    O(s,o)=P(o|s)
  • POMDPs are MDPs over belief states b
  • One way to solve: Use truncated expectimax for approximate values of actions
  • VPI based agent

Hidden Markov Models

  • Reasoning about a sequence of observations over time
  • Value of X at a given time is called the state
  • Parameters: transition probabilities/dynamics show how state evolves over time
  • Stationary assumption: transition probabilities the same at all times
  • Initial distribution
    P(X1)
  • Transitions:
    P(Xt|Xt1)
  • Emissions:
    P(Et,Xt)
  • Markov hidden process: Future depends on past via the present
  • Current observation is independent of everything else given current state

Mini-Forward Algorithm

  • Find
    P(X)
    on some day t
  • P(xt)=xt1P(xt|xt1)P(xt1)
  • Can converge as
    t
  • Stationary distribution:
    P=P+1

Lecture 20: 11/12/19

  • Localization: Only 1 sensor reading could be wrong. Compute probability distribution of where robot can be
  • B(XT)=P(XT|e1:t)

Inference (Passage of time)

  • Current belief
    B(XT)=P(XT|e1:t)
  • After 1 time step:
    P(Xt+1|e1:t)
    • Prediction of new time step from old evidence
    • B(Xt+1)=xtP(X|xt)B(xt)
    • No
      et+1

Observation

  • We have current belief
    B(Xt+1)=xtP(X|xt)B(xt)
  • New evidence comes in:
    P(Xt+1|e1:t+1)P(Xt+1,et|e1:t+1)
    • =P(et+1|e1:t,Xt+1)P(Xt+1|e1:t)
    • Or compactly,
      P(Xt+1|e1:t+1)P(et+1|Xt+1)B(Xt+1)
    • Must renormalize

Forward Algorithm

  • P(xt|e1:t)=P(et|xT)xt1P(xt|xt1)P(xt1,e1:t1)

Particle Filtering

  • Just use sampling instead of dealing with exact beliefs
  • Good if
    |X|
    is too big to store in memory (if X is continuous)
  • Track samples of X, not all values
  • Time per step is linear in the number of samples
  • Generall N particles <<
    |X|
  • x=sample(P(X|x))
    • Captures passage of time
  • Observation:
    • Compute a weight for each particle of how likely is based on evidence
    • B(X)P(e|X)B(X)
  • Resample:
    • Resample particles to go back to all weight 1 particles
    • Resample N times from weighted sample distribution (equivalent to renormalizing)

Dynamic Bayes Nets (DBNs)

  • Track multiple variables over time using multiple evidence
  • Idea: Repeat a fixed Bayes net structure at each time

Lecture 21: 11/14/19

Classification

  • Spam vs ham prediction of new future emails
  • Features:
    • Text Patterns, words
    • Other features unrelated to email content: Sender in contacts
  • Ex. Digit recognition
    • Get a large collection of example images, labeled
    • Want to learn to predict labels of new, future digit images
    • Features: pixels, num components, aspect ratio, numloops

Model based classification

  • Naive bayes: Assume all features are independent of label
  • Inference: Get joining probability distribution
  • Count the probability that each pixel is on for that label
  • Bag of words model: Features = Wi is word at position i
    • Insensitive to word order or reordering
    • Weight each word along the way to give a total score for spam vs ham

Training vs Test

  • Training set, Held out set, test data
  • Use test data to get a sense of accuracy in the real world
  • Features: attribute-value pairs to characterize each x
  • Never peek at the test set
  • Evaluation:
    • Accuracy: fraction of instances predicted correctly
      • Don't want false positives
    • Friend classified as a gorilla
    • Overfitting:
      • When you try to maximize accuracy on training data
  • Odds ratio of infinity blows up

Parameter Estimation

  • Estimating the distribution of a RV
  • Elicitation: ask a human
  • Empirically: use training data (learning)

Maximum Likelihood Estimation

  • Data: Observed set D of heads and tails
  • Hypothesis space: binomial distributions
  • Learning: Finding
    θ
    is an optimization problem
  • MLE:
    θ^=argmaxθlnθαH(1θ)αT
    • $ = \frac{\alpha_H}{\alpha_H + \alpha_T}$
  • Another option is to consider the most likely parameter value given the data

Laplace Smoothing

  • Pretend that you saw every outcome once more than you actually did
  • Avoids the division by 0 problem
  • Extended laplace estimate: pretend we saw everything
    k
    times
  • Tuning on held out data
    • Learn parameters from training data
    • Tune hyperparamaters on different data
    • For each value of hyperparameters, train and test on held out data
    • Choose the best value, do a final test on test data

Lecture 22: 11/19/19

  • Reinforcement Learning vs Supervised Learning
    • RL finds policy - mapping from states to actions

Perceptron

  • Inputs are feature vectors
  • Each feature has a weight, sum is the activation
  • Activation(x) =
    wf(x)
    • Positive outputs +1, negative output -1
  • Weight updates
    • Start with weights = 0
  • For each training instance:
    • Classify current weights
    • If correct, don't do anything. If wrong, adjust the weight vector (
      w=w+yf
      )
    • Add/subtract the feature vector

Multi-Class Perceptron

  • For multiclass, there is a weight vector for each class
  • Score of class y:
    wyf(x)
  • Prediction highest score wins:
    y=argmaxywyf(x)
  • Algorithm:
    • Start with weights all 0
    • Pick up training examples 1 by 1
    • Predict with current weights
    • If wrong,
      wy=wyf(x)
      ,
      wy=wy+f(x)

Properties

  • Separability: true if some parameters get the training set perfectly correct
  • Convergence: if the training is separable, perceptron will eventually converge (binary case)
  • Mistake bound:
    kδ2

Problems

  • If data is not separable, weights might thrash
  • Averaged perceptron can help
  • Mediocre generalization sometimes barely finds the solution
  • Overtraining/overfitting
  • Non-separable case: use probabilistic decision
    • Very positive score, probability close to 1
    • Sigmoid function:
      ϕ(z)=11+ez

MLE

  • maxwll(w)=maxwilogP(y(i)|x(i),w)

Lecture 23: 11/21/19

Gradient Ascent

  • Perform update in uphill direction for each coordinate
  • For
    g(w1,w2)
    • w1w1+αgw1(w1,w2)
    • ww+αwg(w)
    • g(w)=ilogP(y(i)|x(i),w)
    • Keep doing this until update changes
      w
      by around
      0.11%
  • Can't make
    α
    too big
  • Mini-batch gradient ascent: compute gradients for points in parallel
  • Stop when log likelihood of hold-out data begins to decrease

Neural Networks

  • 2 layer neural network can approximate any continuous function
  • Last year = still logistic regression
  • More layers before this last layer
  • Feasure are learned rather than hand-designed

Lecture 24: 12/3/19

Deep Neural Networks

  • Softmax:
    P(y1|x;w)=ez1ez1+ez2+ez3
  • Deep neural network:
    maxwll(w)=maxwilogP(y(i)|x(i);w)
    • w is just a much larger vector
    • Run gradient ascent, stop when log likelihood of hold out data starts to decrease
  • Covariate shift: correlations present in train data not in test data

Inverse RL

  • Usually we try to find an optimal policy
  • Inverse planning: We see optimal behavior and try to recover the reward function that demonstrates the reward behavior
  • Reward engineering is hard, try demonstrating behavior and extract reward function
  • Aka inverse planning, inverse optimal control
  • More causal model of learning

IRL Formulation

  • Given
    ξD
    , find
    R(s,a)=θTϕ(s,a)
    such that
    R(ξD)maxξR(ξ)
    • But 0/constant reward ends up being a solution
  • Maximum Margin Planning
    • R(ξD)maxξR[ξ+l(ξ,ξD)]
    • Search for a reward function giving a bonus to trajectories farther from the demonstration
    • Makes it a little more difficult to pick the demonstration
  • maxθ[θTϕ(ξD)maxξ[θTϕ(ξ)+l(ξ,ξD)]]
    • ξθ
      is arg max
    • Taking gradient is hard because there's a max nested
    • Take a subgradient: Pick different directions
      • θ=ϕ(ξD)ϕ(ξθ)
  • What if the demonstrator is not optimal?
  • Bayesian view:
    P(ξD|θ)
    , run inference on
    θ
    • Use softmax:
      P(ξD|θ)eβθTϕ(ξD)
    • Bayesian belief update using demonstration as evidence
    • Maximum Entropy RL just looks at maximum likelihood estimate