Try   HackMD

CS467 Cheatsheet

Gradient Descent

  • Gradient:
    wf(w)=[fw0fwd]Rd
    , where
    wRd
  • gradient is the direction of steepest ascent
  • negative gradient is the direction of steepest descent
  • update rule:
    wwηwL(w),
    where
    L(w)
    is the loss function

Convextiy

  • convex function
    f
    : a line connecting
    (x1,f(x1))
    and
    (x2,f(x2))
    must lie above the function
  • If
    f(x)0
    and exists everywhere, then
    f
    is convex
  • If
    f
    is convex, then
    g(x)=f(Ax+b)
    is convex
  • If
    f(x)
    and
    g(x)
    are convex, so is
    f(x)+g(x)
  • For a convex funcion, any local minimum is a global minimum

Maximum Likelihood Estimation (MLE)

  • posit a probablistic process that generated our data
  • find parameters
    w
    that make observed data seem most likely
    • maxwP(D;w)
      , where data
      D={(x(i),y(i))}i=1n
    • view
      w
      as being constant-valued but unknown
  • Recipe: take the negative log-likelihood of data as the loss function, and then use gradient descent to find the parameters
    w
    that miminize loss
    • e.g., for discriminative models,
      L(w)=i=1nlogP(y(i)|x(i);w)

Maximum A Posteriori Estimation (MAP)

  • assume a prior
    P(w)
    , which models our prior belief or preference about
    w
    • view
      w
      as a random variable
  • maxwP(w|D)
    : find parameters
    w
    after seeing the data
    D
    • =maxwP(D|w)P(w)

Linear/Logistic/Softmax Regression

  • Linear regression for regression tasks (predict scalars)
    • has closed-form solution, aka the normal equation:
      • set
        wL(w)
        to
        0
      • w=(XX)1Xy,
        where
        XRn×d,yRn
  • Logistic regression for binary classification tasks
  • Softmax regression for multiclass classification tasks

Logistic vs. Softmax

  • logistic function:
    σ(z)=11+ez
    • range:
      [0,1]
  • softmax function:
    s(zi)=ezik=1Cezk
    • i=1Cs(zi)=1
  • when only two classes (binary classification), softmax regression is equivalent to logistic regression
    • ez1ez0+ez1=1e(z0z1)+1=11+ez,
      where
      z=z1z0

Logistic Regression

  • P(y=1|x)=σ(wx)
  • P(y=1|x)=1P(y=1|x)=111+ewx=σ(wx)
  • Thus,
    P(y|x)
    can be written as
    σ(ywx)
  • With MLE, loss
    L(w)=i=1nlogσ(y(i)wx(i))
    • we call
      y(i)wx(i)
      margin; the larger the margin, the lower the loss
    • logσ()
      is convex
      logistic regression loss function is convex
    • wL(w)=i=1nσ(y(i)wx(i))y(i)x(i)

Discriminative vs. Generative Models

Discriminative

  • directly learn
    P(y|x)
  • MLE maximizes conditional likelihood
    P(y|x)
    • Likelihood of data:
      i=1nP(y(i)|x(i);w)
    • Log-likelihood:
      i=1nlogP(y(i)|x(i);w)
  • e.g., Logistic Regression

Generative

  • learn
    P(x|y)
    (features given the class)
  • learn
    P(y)
    (class prior)
  • MLE maximizes joint likelihood
    P(x,y)=P(x|y)P(y)
  • prediction: we plug in the learned
    P(x|y)
    and
    P(y)
    to get
    P(y|x)
    • P(y|x)=P(x|y)P(y)P(x)
      , where
      P(x)=k=1CP(x|y=k)P(y=k)
    • e.g., for binary classification,
      P(y=1|x)=P(x|y=1)P(y=1)P(x|y=0)P(y=0)+P(x|y=1)P(y=1)
  • e.g., ‌Naive Bayes

k
NN

  • keep the training set when making prediction, "non-parametric"
  • define a metric that measures the distance between two datapoints, e.g., Euclidean distance
  • chose a hyperparameter
    k
  • To predict the label of a test input
    x
    • find its
      k
      nearest neighbors in the training set
    • predict label which is most common among the neighbors