CS189 MT Notes

Lecture 1: 1/20/21

  • Decision boundary has d-1 dimensions
  • Training set error: Fraction of training images not classified correctly
  • Test set error: Fraction of misclassified NEW images, not seen during training
  • Outliers: points whose labels are atypical
  • Overfitting: When test error deteroirates because classifier is sensitive to outliers
  • Validation data is taken from training data
    • Train classifier with different hyperparams
  • Training set: Used to learn model weights
  • Validation set: Used to tune hyperparams
  • Test set: Used as final evaluation of model
  • Supervised learning:
    • Classification: Is this email spam?
    • Regression: How likely does this patient have cancer?
  • Unsupervised learning:
    • Clustering: Which DNA sequences are similar?
    • Dimensionality reduction: What are common features of faces?

Lecture 2: 1/25/21

  • Decision boundary: Boundary chosen by classifer to separate items
  • Overfitting: When decision boundary fits sample points so well that it does not classify future points well
  • Some classifiers work by computing a decision/predictor/discriminant function
    • A function that maps a pt x to a scalar st
      f(x)>0xC
      and vice versa. Pick one for the 0 case
    • Decision boundary is
      {xRd:f(x)=0}
    • This set is in
      d1
      dimensional space (isosurface of f, iso value = 0)
  • Decision boundary is a linear plane (linear decision function)
    • f(x)=wx+α
      ,
    • Hyperplane:
      H={x:wx=α}
    • w
      is the normal vector of H (perpendicular to H)
    • If
      w
      is unit vector,
      wx+α
      is the signed distance from x to H
    • Distance from H to origin is
      α
      ,
      alpha=0
      passes through origin
    • Coefficients in
      w,α
      are weights/params
  • Linearly separable - if there exists a hyperplane that separates all sample pts in class C from all pts not in C

Math Review

  • Vectors default to be column vectors
  • Uppercase roman = matrix, RVs, sets
  • Lowercase roman = vectors
  • Greek = scalar
  • Other scalars:
    • n = # of sample pts
    • d = # of features (per point) aka dimensionality
    • i, j, k are indices
  • functions (ofen scalar):
    f()

Centroid method

  • Centroid method: compute mean
    μc
    of all pts in class C, mean
    μx
    of all pts not in C
  • Use decision fn
    f(x)=(μcμx)x(μcμx)μc+μx2
    • α
      is such that
      f(x)=0
      if it is the midpoint between
      μc,μx

Perceptron Algorithm

  • Slow, but correct for linearly sep pts
  • Numerical optimization algorithm - gradient descent
  • For n sample pts
    X1,Xn
    with labels
    yi{1,1}
  • Consider decision boundary that passes through origin
  • Goal: find weights w/ constraint:
    yiXiw0
    • Xiw0
      if
      yi=1
    • Xiw0
      if
      yi=1
  • Objective/cost/risk function R:
    • positive if a constraint is violated
    • Use optimization to choose w that minimizes R
  • Define loss function
    • L(z,yi)={0yiz0yizotherwise
    • z=Xiw
      is classifier prediction,
      yi
      is correct label
    • If
      z
      has same sign as
      yi
      , loss fn is zero
    • If z has wrong sign, loss fn is positive
  • Risk/objective/cost fn:
    • R(w)=1ni=1nL(Xiw,yi)
    • R(w)=1niVyiXiw
      with
      V
      having all i st
      yiXiw<0
    • R(w)=0
      if all X classified correctly, otherwise positive
  • Goal: Find w that minimizes
    R(w)
    • w=0
      is useless, don't want that

Lecture 3: 1/27/21

  • Gradient descent: Find gradient of R wrt w, take step in opposite direction
  • R(W)=[Rw1Rwd]
  • w(zw)=z
  • R(W)=iVyiXi
  • Take step in
    R(W)
    direction
  • Algorithm:
  • w
    arbitrary nonzero vector (good choice is any
    yiXi
    )
  • while
    R(w)>0
    :
    • V
      set of indices that
      yiXiw<0
    • ww+ϵiVyiXi
  • return w
  • ϵ>0
    is the step size/learning rate
    • Chosen empirically
  • Cons: Slow, each step takes
    O(nd)
    time - n dot product of vectors in
    Rn
  • Perceptron algorithm guarantees convergence no matter step size

Stochastic Gradient Descent

  • Each step, pick 1 misclassified
    Xi
    • do gradient descent on loss fn
      L(Xiw,yi)
  • Known as the perceptron algorithm
  • Each step takes
    O(d)
    time
  • while
    yiXiw<0
    :
    • ww+ϵyiXi
  • return w

Fictitious dimension

  • If separating hyperplane does not pass origin, add a fictitious dimension
  • f(x)=wx+α=[w1w2α][x1x21]
  • Sample pts in
    Rd+1
    all lying on hyerplane
    xd+1=1
  • Run perceptron algorithm in
    d+1
    dimension

Hard-margin SVM

  • Margin of a linear classifier is dist from deicision boundary to nearest sample pt
  • Try to make margin as wide as possible
  • If
    ||w||=1
    , signed dist is
    wXi+α
    • Otherwise signed dist is
      w||w||Xi+α||w||
  • Hence the margin is
    mini1||w|||wXi+α|1||w||
  • QP in
    d+1
    dimensions,
    n
    constraints
  • Objective:
    minw,α||w||22
  • Constraints:
    yi(wxi+α)1
    for
    i=1n
    • Impossible for
      w
      to be set to 0
  • Has a unique solution (if any)
  • Maximum margin classifier aka hard-margin SVM
  • Margin =
    1||w||
  • Slab = 2*margin, margin is dist from boundary to nearest pt

Lecture 4: 2/1/21

Soft Margin SVM

  • Margin is no longer dist from decision boundary to nearest pt
    • Just
      1||w||
  • Prevent abuse of slack by adding a loss term
  • Objective: Find
    w,α,ξi
    to
    min||w||2+Ci=1nξi
  • Constraints:
    yi(Xiw+α)1ξi
    • ξi0
  • QP in
    d+n+1
    dimensional space and
    2n
    constraints
  • There is a tradeoff between C and
    ||w||2
    • C>0
      is a hyperparameter scalar
Column 1 Small C Big C
Desire Max margin
1|w|
Keep most slack variables 0 or small
Danger Underfitting (misclassifies training data) Overfitting (awesome training, awful test)
Outliers Less sensitive Very sensitive
Boundary More "flat" More sinuous/curvy

Features

  • Make nonlinear features that lift points to a higher dimensional space
  • High-d linear classifier <> low-d nonlinear classifier
  • Parabolic lifting map:
    Φ:RdRd+1
    • Φ(x)=[xx2]
    • Lifts x onto paraboloid
      xd+1=x2
    • Linear classifier in
      Φ
      space induces a sphere classifier in
      x
      space
  • Theorem:
    Φ(X1),Φ(Xn)
    are linearly separable iff
    X1,Xn
    are separable by a hypersphere
  • Proof: Consider hypersphere in
    Rd
    center c, radius
    ρ
  • xc2<ρ2
  • x22cx+c2<ρ2
  • [2c1][xx2]<ρ2c2
  • Normal vector in
    Rd+1
    ,
    Φ(x)
  • Points inside sphere means same side of hyperplane in
    Φ
    space

Axis-aligned ellipsoid/hyperboloid

  • Φ:RdR2d
  • Φ(X)=[x12x22x32xd2x1x2xd]
  • 3D:
    Ax12+Bx22+Cx32+Dx1+Ex2+Fx3+α=0
  • Hyerplane is
    wΦ(x)+α=0
    • w=[ABCDEF]

Ellipsoid/hyperboloid

  • Φ(x):RdR(d2+3d)/2
  • 3D:
    Ax12+Bx22+Cx32+Dx1x2+Ex2x3+Fx3x1+Gx1+Hx2+Ix3+α=0
  • Isosurface dfined by this equation is a quadric (2d: conic section)

Degree-p polynomial

  • Ex. cubic in
    R2
  • Φ(x)=[x13x12x2x1x22x23x12x1x2x22x1x2]
  • Φ(x):RdRO(dp)
  • Linear has smaller margin, higher degree gives wider margins
  • Going to a dimension high enough makes it linearly separable, but could overfit

Edge Detection

  • Edge detector: algorithm for approximating grayscale/color gradients in images
    • tap filter, sobel filter, oriented Gaussian derivative filter
  • Compute gradients pretending it is a continuous field
  • Collect line orientations in local histograms (each having 12 orientations), use histograms as features instead of raw pixels

Lecture 5: 2/3/21

ML Abstractions

  • Application/data:
    • Is data labeled/classified?
    • Yes: Categorical (classification), quantitative(regression)
    • No: Similarity (clustering), positioning (dimensionality reduction)
  • Model:
    • Decision functions: linear, polynomial, logistic
    • nearest neighbors, decision trees
    • Features
    • Low vs high capacity (linear has low, poly have high, neural nets have high): affect overfitting, underfitting, inference
    • Capacity - a measure of how complicated your model can get
  • Optimization Problem
    • Variables, objective fn, constraints
      • Ex. unconstrainted, convex programs, least squares, PCA
  • Optimization Algorithm
    • Gradient descent, simplex, SVD

Optimization Problems

  • Unconstrained: Find w that minimizes a continuous objective fn
    f(w)
    , f is smooth if gradient is continuous too
  • Global min:
    f(w)f(v),v
  • Local min:
    f(w)f(v),v
    in tiny ball centered at w
  • Finding local min is easy, gloal min is hard
  • Except for convex functions
    • Line segment connecting 2 points does not go below f
    • Perceptron risk fn is convex and nonsmooth
  • A convex fn has either:
    • no min (goes to
      )
    • just one local min which is the global min
    • Connected set of local min which are all global min
  • Algorithms for smooth f
    • Gradient descent
      • blind : repeat
        wwϵf(w)
      • stochastic (just 1 training pt at a time):
        wwϵL(yi,zi)
      • line search
    • Newton's method (needs Hessian matrix)
    • Nonlinear conjugate gradient
  • Algorithms for nonsmooth f
    • Gradient descent
    • BFGS
  • These algs find local min
  • Line search: Find local min along search direction by solving an optimization problem in 1D
    • Secant method (smooth only)
    • Newton-Raphson (smooth only, may need Hessian)
    • Direct line search (non smooth) like golden section search
  • Constrained Optimization (smooth equality constraints)
    • Goal: Find w that minimizes
      f(w)
      subject to
      g(w)=0
      where g is smooth
    • Alg: Use lagrange multipliers
  • Linear Program
    • Linear objective, linear ineq constraints
    • Goal: Find w that maximizes
      cw
      subject to
      Awb
      • Aiwbi
        for
        i[1,n]
  • Feasible region: Set of pts that satisfy all constraints
    • LP feasible region is a polytope (not always bounded/finite)
  • Point set P is convex if every pair of points' line segment lies fully in P
  • Active constraints: constraints that achieve equality at optimum

Lecture 6: 2/8/21

  • Decision theory/risk minimization
    • Multiple sample pts w/ diff classes can lie at same pt
    • Want a probabilistic classifier
  • Example: 10% population has cancer, 90% doesn't
  • Posterior probability:
    P(Y=1|X)
  • Prior probability:
    P(Y=1)
  • Loss function
    L(z,y)
    where z is prediction, y is true label
    • Loss function specifies how bad it is if classifier predicts z but true class is y
    • 36% loss of 5 is worse than 64% loss of 1, so recommend further cancer screening
    • 0-1 loss function is 1 for incorrect, 0 for correct
    • Loss function does not need to be convex
  • Risk for decision rule r is expected loss over all values of x,y
    • R(r)=E[L(r(X),Y)]
    • x(L(r(x),1)P(Y=1|X=x)+L(r(x),1)P(Y=1|X=x))P(X=x)
  • Bayes decision rule/Bayes classifier is fn
    r
    that minimizes functional
    R(r)

  • If L is symmetric, pick the class with the biggest posterior probability
  • Bayes risk/optimal risk is risk of Bayes classifier
  • No decision rule that gives lower risk
  • Deriving/using
    r
    is called risk minimization
  • Expected values of
    g(X)=E[g(X)]=g(x)f(x)dx
  • Mean:
    μ=xf(x)dx
  • If L is 0-1 loss,
    R(r)
    is probability that r(x) is wrong
    • Bayes optimal decision boundary is
      {x:P(Y=1|X=x)=0.5}

3 ways to build classifiers

  • Generative models (ex. LDA: Linear Discriminant Analysis)
    • Assume sample pts come from probability distributions, different for each class
    • Guess form of distribution
    • For each class C, fit distribution params to class C pts, giving
      f(X|Y=c)
    • For each c, estimate
      P(Y=c)
    • Bayes thm gives
      P(Y|X)
    • If 0-1 loss, pick class C that maximizes
      P(Y=c|X=x)
      equivalently maximizes
      f(X=x|Y=c)P(Y=c)
  • Discriminative models (ex. Logistic regression)
    • Skip prior probabilities, model
      P(Y|X)
      directly
  • Find decision boundary (ex. SVM)
  • Advantages of 1,2:
    P(Y|X)
    tells you probability that your guess is wrong
  • Advantage of 1: Diagnose outliers if P(X) is very small
  • Disadvantage of 1: Often hard to estimate distributions correctly, real distributions rarely match standard ones

Lecture 7: 2/10/21

  • Gaussian Discriminant Analysis (GDA): Each class comes from a normal distribution
  • Isotropic normal dist:
    XN(μ,σ2)
    :
    f(x)=1(2πσ)dexp(xμ22σ2)
    • μ,x
      are length d vectors,
      σ
      is a scalar
    • For each class C, estimate mean
      μc
      , variance
      σc2
      , prior
      πc=P(Y=C)
    • Given x, Bayes decision rule
      r(x)
      predicts class C that maximizes
      f(X=x|Y=C)πC
  • Qc(x)=ln((2π)d)fc(x)πc)=xμ22σ2dln(σc)+ln(πc)
    • Quadratic in x, easier to optimize
    • Find
      argmaxcQc(x)
  • Baye's decision boundary is where
    Qc(x)Qd(x)=0
    • In 1D, may have 1 or 2 or 0 pts
  • Use Bayes' to recover posterior probabilities in 2 clsas
    • P(Y=C|X)=f(X|Y=C)πcf(X|Y=C)πc+f(X|Y=D)πd
    • =s(QC(x)QD(x))
  • Logistic/sigmoid function:
    s(γ)=11+eγ

Linear Discriminant Analysis (LDA)

  • Variant of QDA that has linear decision boundaries
    • Less likely to overfit
    • Less true probabilistic estimation
    • Use validation of both methods to see which is better
  • Assume that all gaussians have same variance
  • QC(x)QD(x)=((μCμD)σ2)x+(μC2μD22σ2ln(πc)ln(πd))
  • Choose C that maximizes linear discriminant fn
    • (μCσ2)xμC22σ2+ln(πc)
  • Generative model as opposed to discriminative model

Maximum Likelihood Estimation (MLE)

  • Flip biased coin, heads with prob
    p
    and tails with prob
    1p
  • Goal: estimate
    p
  • num of heads are
    XB(n,p)
    , binomial dist
  • L
    is the likelihood fn
    • L(p)=P(X=8)=45p8(1p)2
    • Should be as a function of distribution parameters
  • MLE is a method of estimating params of a statistical distribution by picking params that maximize
    L
    • Is a method of density estimation: estimating PDF from data
    • Find p that maximizes
      L(p)
  • Likelihood of Gaussian
    • L(μ,σ;X1,X2,Xn)=f(X1)f(X2)f(Xn)
    • l
      is
      ln
      likelihood of
      L
    • Maximizing l and log likelihood are same
    • l(μ,σ;X1,X2,Xn)=lnf(X1)+lnf(X2)+lnf(Xn)
    • =i=1nXiμ22σ2dln2πdlnσ)
      • Which is a summation of ln of normal PDF
    • Want to set
      μl=0,lσ=0
      • μl=0μ^=1ni=1nXi
      • lσ=0σ^2=1dni=1nXiμ2
        • Use
          μ^
          for
          μ
  • Tells us to use sample mean and variance of pts in class C to estimate mean and variance of Gaussian for class C
  • For QDA: estimate conditional mean
    μc^
    and conditional variance
    σc^2
    of each class C separately
    • πc^=ncnD
      • Denominator is total sample pts in all classes
  • For LDA: SAme means and priors, 1 variance for all classes
    • Pooled within-class variance:
      σ^2=1dnC{i:yi=c}Xiμ^c2

Lecture 8: 2/17/21

  • For
    v,λ,A
    , v is an evec of
    Ak
    with
    λk
  • If A is invertible,
    A1
    has evec v with
    1λ
  • Spectral theorem: Every real symmetric nxn matrix has real eigenvalues and n mutually orthogonal evecs
  • A1x2=1
    isan ellpisoid with axes
    v1,vn
    and radii
    λ1,λn
  • Special case if A is diagonal, evecs are coordinate axes and ellipsoid is axis-aligned
  • Symmetric matrix is:
    • PD: All pos eval
    • PSD: All nonneg evals
    • Indefinite matrix: has at least 1 pos eval and 1 neg eval
    • Invertible: Has no zero eigenvalue
  • xA2x
    will be PD bc evecs are squared
  • Orthonormal matrix: acts like rotation or reflection
  • Thm:
    A=VΛV=i=1nλivivi


Lecture 8: 2/22/21

Anisotropic Gaussians

  • Normal PDF:
    f(x)=n(q(x))=1(2π)d|Σ|exp(12(xμ)Σ1(xμ))
    • n(q)=1(2π)d|Σ|eq/2
      • RR
        (exponential)
    • q(x)=(xμ)Σ1(xμ)
      • RdR
        (quadratic)
  • Covariance matrix:
    Σ=VΓV
    • Eigenvalues of
      Σ
      are variances along evecs
      Γii=σi2
  • Σ12=VΓ12V
    • Maps spheres to ellipsoids
    • Eigenvalues of
      Σ12
      are stdev
      Γii12=σi
      (Gaussian width/ellipsoid radii)

MLE for Anisotropic Gaussians

  • Given sample pts and classes, find best-fit Gaussians
  • QDA:
    Σ^c=1nci:yi=C(Xiμ^c)(Xiμ^c)
    • Conditional covariance (for pts in class C)
    • Outer product, dxd matrix
    • π^c,μ^c
      same as before
  • sigmac always PSD but not always PD
  • Choose C to maximize discriminant fn
  • Decision fn is quadratic, may be indefinite
  • LDA interpeted as projecting points on a line
  • For 2 classes
    • LDA has d+1 parameters (w,
      α
      )
    • QDA has
      d(d+3)2+1
      params
    • QDA more likely to overfit
    • With added features LDA gan give nonlinear, QDA nonquadratic
    • We don't get true optimum Bayes classifier (estimate distributions from finite data)
    • Changing priors/loss = adding constants to discriminant fns
    • Posteroir gives decision boundaries for 10% probability, 50%, 90%
      • Choosing isovalue = p same as
      • Choosing asymmetric loss p for false pos, 1-p for false neg OR
        πc=1p,πd=p

Terminology

  • Let X be n x d design matrix of sample pts
  • Each row i of X is a sample pt
    Xi
  • Centering X: Subtract
    μ
    from each row of X
    XX˙
  • Let R be uniform distribution on sample pts
  • Sample cov matrix is
    Var(R)=1nX˙X˙
  • Decorrelating
    X˙
    : Apply rotation
    Z=X˙V
    • Var(R)=VΛV
    • Var(Z)=Λ
    • Sphering
      X˙
      :
      W=X˙Var(R)1/2
  • Whitening X: Centering + sphering
    • W has cov matrix I

Lecture 9: 2/24/21

  • Regression: Fitting curves to data
  • Classification: Given pt, predict its class
  • Regression: Given pt, predict numerical value
  • Choose form of regerssion fn
    h(x;p)
    with parameter p
    • Like decision fn in classification
  • Choose a cost fn (objective fn) to optimize
    • Usually based on a loss fn
  • Regression fns:
      1. Linear:
        h(x;w,α)=wx+α
      1. Polynomial: Add features for linear
      1. Logistic:
        h(x:w,α)=s(wx+α)
      • s(γ)=11+eγ
  • Loss fns: z is prediction
    h(x)
    , y is true label
    • A) Squared error:
      L(z,y)=(zy)2
    • B) Absolute error:
      L(z,y)=|zy|
    • C) Logistic loss/cross-entropy:
      L(z,y)=yln(z)(1y)ln(1z)
  • Cost fns to minimize
    • a) Mean loss:
      J(h)=1ni=1nL(h(Xi),yi)
    • b) Max loss:
      J(h)=maxi=1nL(h(Xi),yi)
    • c) Weighted sum:
      J(h)=i=1nωiL(h(Xi),yi)
    • d)
      l2
      penalized/regularized:
      J(h)=+λw22
      • is one of the previous options
    • e)
      l1
      penalized/regularized:
      J(h)=+λw1
  • Famous regression methods:
    • Least squares linear regression: 1Aa
      • Quadratic cost, min w/ calculus
    • Weighted least-squares linear: 1Ac
      • Quadratic cost, min w/ calculus
    • Ridge regression: 1Ad
      • Quadratic cost, min w/ calculus
    • LASSO: 1Ae
      • QP w/ exponential num of constraints
    • Logistic regression: 3Ca
      • Convex cost fn, min w/ gradient descent/Newton's
    • Least absolute deviations: 1Ba
      • LP
    • Chebyshev criterion: 1Bb
      • LP

Least Squares Linear Regression

  • Linear regression fn (1) + squared loss fn (A) + cost fn (a)
  • Optimization problem:
    minw,αi=1n(Xiw+αyi)2
  • X is n x d design matrix of sample pts
  • y is n-vector of scalar labels
  • Xi
    transposes the column vector point
  • Feature column:
    Xj
  • Usually
    n>d
  • X is now n x (d+1) matrix, w is a (d+1) vector
  • [x1x21][w1w2α]
  • Residual sum of squares:
    RSS(w)=minwXwy2
  • Optimize with calculus:
    • If
      XX
      is singular, problem is unconstraind (all pts fall on hyperplane)
    • Never overconstrained
    • Linear solver to find
      w=(XX)1Xy
    • Pseudoinverse (d+1, n):
      X+=(XX)1X
    • Pseudoinverse is a left inverse:
      X+X=(XX)1XX=I
    • Predicted values are
      y^=Xw=XX+y=Hy
    • Hat matrix:
      H=XX+
      (n x n matrix)
  • Advantage: Easy to compute, unique/stable solution
  • Disadvantage: Very sensitive to outliers bc errors are squared, fails if
    XX
    is singular

Logistic Regression

  • Logistic regerssion fn (3) + logistic loss fn © + cost fn (a)
  • Fits probabilities in range (0,1)
  • Discriminative model (as opposed to QDA/LDA being generative models)
  • minwJ(w)=minwi=1nL(Xiw,yi)=minwi=1n(yiln(Xiw)+(1yi)ln(1Xiw))
  • J(w)
    is convex, solve by gradient descent
  • s(γ)=s(γ)(1s(γ))
  • si=s(Xiw)
    ,
    wJ=X(ys(Xw))
    , with
    s(Xw)
    applied elementwise
  • Gradient descent rule:
    ww+ϵX(ys(Xw))
  • SGD:
    ww+ϵ(yis(Xiw))Xi
    • Shuffle pts into random order, process one by one
    • For large n, sometimes converges before visiting all pts
  • Logistic regression always separates linearly separable pts
  • w2J(w)=i=1nsi(1si)XiXi=XΩX
    • Ω=diag(si(1si))0
      making it PD
    • So
      XΩX0
      so
      J(w)
      is convex
    • Find solution to
      (XΩX)e=X(ys)
    • Example of iteratively reweighted least squares
    • Misclassified points far from decision boundary have large influence
    • Points close to decision boundary also has medium to large influence
  • LDA vs Logistic regression
  • LDA Advantages:
    • For well separated classes LDA is stable, logreg is unstable
    • >2
      classes easy and elegant, logreg needs modifications (softmax regression)
    • LDA slightly more accurate when classes nearly normal, esp if n is small
  • Logreg advantages:
    • More emphasis on decision boundary, always separates linearly separable pts
    • More robust on some non-Gaussian distributions (w/ large skew)

Lecture 10: 3/1/21

Least Squares Polynomial Regression

  • Replace each
    Xi
    with feature vector
    Φ(Xi)
    • Ex.
      Φ(Xi)=[Xi12Xi1Xi2Xi22Xi1Xi21]
    • Can also use non polynomimal features
  • Logistic regression + quadratic features = some form of posteriors like QDA
  • Very easy to overfit

Weighted Least-Squares Regression

  • Linear regression fn (1) + squared loss fn (A) + cost fn ©
  • Assign each sample pt a weight
    ωi
    , create
    Ω=diag(ωi)
    • Bigger
      ωi
      means work harder to minimize
      |y^iyi|2
  • minw(Xwy)Ω(Xwy)=i=1nωi(Xiwyi)2
    • XΩXw=XΩy
      , solve for
      w

Newton's Method

  • Iterative optimization method for smooth fn
    J(w)
  • Often much faster than gradient descent
  • Approximate
    J(w)
    by quadratic fn
  • J(w)=J(v)+(2J(v))(wv)
  • w=v(2J(v))1J(v)
  • Repeat until convergence
  • Warnings:
    • Doesn't know diff between minima, maxima, saddle pts
    • Starting pt must be close enough to desired critical point

Roc Curves

  • For test sets to evaluate classifier
  • Receiver Operator Characteristics
  • Rate of FP vs TP for range of settings
  • x-axis is False Positive rate = % of negative items incorrectly classified as pos
  • y-axis is True positive rate = % of pos classified as pos
  • Top right: always classify positive (
    P0
    )
  • Bottom left: Always classify negative (
    P>1
    )
  • Want high area under the curve (close to 1)

Lecture 11: 3/3/21

Statistical Justifications for Regression

  • Sample pts come from unknown prob dist
    XiD
  • y values are sum of unknown nonrandom fn + random noise
    • yi=g(Xi)+ϵi,ϵD
      with
      D
      having mean 0
  • Goal of regression: find h that estimates
    g
  • Ideal approach: Choose
    h(x)=EY[Y|X=x]=g(x)+E[ϵ]=g(x)

Least Squares regression for MLE

  • Suppose
    ϵiN(0,σ2)
    • yiN(g(Xi),σ2)
  • l(g;X,y)=ln(F(y1)f(y2)f(yn))=12σ2(yig(Xi))2const
    • MLE over g is minimizing sum of squared errors

Empirical Risk

  • Risk for hypothesis h is xpected loss
    R(h)=E[L]
  • Discriminative: Don't know X dist D
  • Empirical distribution: discrete uniform distribution over the sample pts
  • Empirical risk: Expected loss under empirical distribution
    • R^(h)=1ni=1nL(h(Xi),yi)
    • So minimize sum of loss fns

Logistic Loss for MLE

  • Actual probability pt
    Xi
    is in class C is
    yi
    , predicted prob is
    h(Xi)
  • Imagine
    β
    duplicate copies of
    Xi
    • yiβ
      are in class c,
      (qyi)β
      copies are not
    • L(h;X,y)=h(Xi)yiβ(1h(Xi))(1yi)β
    • minl(h)=β
      logistic loss fn
      L(h(Xi),yi)
      • Logistic loss is
        yiln(zi)(1yi)ln(qzi)
    • Max Likelihood > Min sum of logistic loss fns

Bias-Variance Decomposition

  • 2 source of error in hypothesis:
    • bias: Error due to inability of hypothesis to fit g perfectly
    • variance: Error due to random noise in data
  • Model:
    XiD,ϵiD,yi=g(Xi)+ϵi
    • Fit hypothesis h to X,y
    • h is a RV with random weights
  • Consider arbitrary pt
    z,γ=g(z)+ϵ
    • E[γ]=g(z),var(γ)=var(ϵ)
  • Risk fn when loss = squared error:
    r(h)=E[L(h(z),γ)]
    • Taking expectation over all possible training sets X,y, values of
      gamma
    • R(h)=E[(h(z)γ)2]
    • R(h)=(E[h(z)]g(z))2+Var(h(z))+Var(ϵ)
      • First term: bias^2
      • Second term: variance of method
      • Third term: Irreducible error
  • Underfitting = too much bias
  • Most overfitting = too much variance
  • Training error reflects bias but not variance
  • Most dist var
    0
    as
    n
  • If h can fit g exactly, many distributions bias
    0
    as
    n
  • If h cannotfit g well, bias is large for most points
  • Adding a good feature reduces bias, adding a bad feature rarely increases it
    • Good features reduce bias more than it increases variance
  • Adding a feature usually increases variance
  • Noise in test set affects only var(
    ϵ
    )
    • Noise in training set affects only bias and var of h
  • Can't precisely measure bias and variance of real world data
  • Can test learning algs by choosing g and making synthetic data

Lecture 12: 3/8/21

Ridge Regression

  • 1 + A + d
    • l2 penalized mean loss
  • minXwy2+λw2
    • w
      is
      w
      with component
      α
      replaced by zero
    • Don't penalize
      α
    • Regularization term/penality term for shrinkage
    • Small
      w
      guarantees PD normal eqs
    • Reduces overfitting by reducing variance by penalizing large weights
    • Normal equations:
      (XX+λI)w=Xy
      • II
        with bottom right = 0
  • Solve for w, return
    h(z)=wz
  • Increasing
    λ
    means more regularization
  • For
    y=Xv+e
    , e being noise
    • Variance of RR is
      var(z(XX+λI)1Xe)

Bayesian Justification

  • Prior probability:
    wN(0,σ2)
  • Apply MLE
  • Max log posterior:
    ln(L(w))+ln(f(w))const
    • Min
      minXwy2+λw2

Feature Subset Selection

  • All features increase variance, not all features reduce bias
  • Idea: Identify poorly predictive features, ignore them (weight 0)
    • Less overfitting, smaller test errors
    • 2nd motivation: inference
  • Usually hard b/c features can partly encode same info
  • Algorithm: Try all
    2d1
    subsets of features
    • Slow, use cross validation
  • Heuristic 1: Forward stepwise selection
    • Start with null model (0 features)
    • Repeatedly add best feature until validation error increases
    • Train
      O(d2)
      models
    • Not perfect, won't find best 2-features model if neither yields best 1-feature model
  • Heuristic 2: Backward stepwise selection
    • Start with all d features, repeatedely remove features that give best reduction in val error
    • Trains
      O(d2)
      models

LASSO

  • Regression w/ regularization: 1 + A + e
    • l1 penalized mean loss
    • minXwy2+λw1
  • Algs: subgradient descent, least-angle regression