CS189 MT Notes

Lecture 1: 1/20/21

Decision boundary has d-1 dimensions
Training set error: Fraction of training images not classified correctly
Test set error: Fraction of misclassified NEW images, not seen during training
Outliers: points whose labels are atypical
Overfitting: When test error deteroirates because classifier is sensitive to outliers
Validation data is taken from training data
- Train classifier with different hyperparams
Training set: Used to learn model weights
Validation set: Used to tune hyperparams
Test set: Used as final evaluation of model
Supervised learning:
- Classification: Is this email spam?
- Regression: How likely does this patient have cancer?
Unsupervised learning:
- Clustering: Which DNA sequences are similar?
- Dimensionality reduction: What are common features of faces?

Lecture 2: 1/25/21

Decision boundary: Boundary chosen by classifer to separate items
Overfitting: When decision boundary fits sample points so well that it does not classify future points well
Some classifiers work by computing a decision/predictor/discriminant function
- A function that maps a pt x to a scalar st
  $f (x) > 0 ⟹ x \in C$ and vice versa. Pick one for the 0 case
- Decision boundary is
  ${x \in R^{d} : f (x) = 0}$
- This set is in
  $d - 1$ dimensional space (isosurface of f, iso value = 0)
Decision boundary is a linear plane (linear decision function)
- $f (x) = w^{⊤} x + α$ ,
- Hyperplane:
  $H = {x : w^{⊤} x = - α}$
- $w$ is the normal vector of H (perpendicular to H)
- If
  $w$ is unit vector,
  $w^{⊤} x + α$ is the signed distance from x to H
- Distance from H to origin is
  $α$ ,
  $a l p h a = 0 ⟷$ passes through origin
- Coefficients in
  $w, α$ are weights/params
Linearly separable - if there exists a hyperplane that separates all sample pts in class C from all pts not in C

Math Review

Vectors default to be column vectors
Uppercase roman = matrix, RVs, sets
Lowercase roman = vectors
Greek = scalar
Other scalars:
- n = # of sample pts
- d = # of features (per point) aka dimensionality
- i, j, k are indices
functions (ofen scalar):
$f (\cdot)$

Centroid method

Centroid method: compute mean
$μ_{c}$ of all pts in class C, mean
$μ_{x}$ of all pts not in C
Use decision fn
$f (x) = (μ_{c} - μ_{x})^{⊤} x - (μ_{c} - μ_{x}) \frac{μ_{c} + μ_{x}}{2}$
- $α$ is such that
  $f (x) = 0$ if it is the midpoint between
  $μ_{c}, μ_{x}$

Perceptron Algorithm

Slow, but correct for linearly sep pts
Numerical optimization algorithm - gradient descent
For n sample pts
$X_{1}, \dots X_{n}$ with labels
$y_{i} \in {- 1, 1}$
Consider decision boundary that passes through origin
Goal: find weights w/ constraint:
$y_{i} X_{i}^{⊤} w \geq 0$
- $X_{i}^{⊤} w \geq 0$ if
  $y_{i} = 1$
- $X_{i}^{⊤} w \leq 0$ if
  $y_{i} = - 1$
Objective/cost/risk function R:
- positive if a constraint is violated
- Use optimization to choose w that minimizes R
Define loss function
- $L (z, y_{i}) = {\begin{cases} 0 & y_{i} z \geq 0 \\ - y_{i} z & otherwise \end{cases}$
- $z = X_{i}^{⊤} w$ is classifier prediction,
  $y_{i}$ is correct label
- If
  $z$ has same sign as
  $y_{i}$ , loss fn is zero
- If z has wrong sign, loss fn is positive
Risk/objective/cost fn:
- $R (w) = \frac{1}{n} \sum_{i = 1}^{n} L (X_{i}^{⊤} w, y_{i})$
- $R (w) = \frac{1}{n} \sum_{i \in V} - y_{i} X_{i}^{⊤} w$ with
  $V$ having all i st
  $y_{i} X_{i}^{⊤} w < 0$
- $R (w) = 0$ if all X classified correctly, otherwise positive
Goal: Find w that minimizes
$R (w)$
- $w = 0$ is useless, don't want that

Lecture 3: 1/27/21

Gradient descent: Find gradient of R wrt w, take step in opposite direction
$\nabla R (W) = {[\begin{matrix} \frac{\partial R}{\partial w_{1}} & \dots & \frac{\partial R}{\partial w_{d}} \end{matrix}]}^{⊤}$
$\nabla_{w} (z^{⊤} w) = z$
$\nabla R (W) = - \sum_{i \in V} y_{i} X_{i}$
Take step in
$- \nabla R (W)$ direction
Algorithm:
$w \leftarrow$ arbitrary nonzero vector (good choice is any
$y_{i} X_{i}$ )
while
$R (w) > 0$ :
- $V \leftarrow$ set of indices that
  $y_{i} X_{i}^{⊤} w < 0$
- $w \leftarrow w + ϵ \sum_{i \in V} y_{i} X_{i}$
return w
$ϵ > 0$ is the step size/learning rate
- Chosen empirically
Cons: Slow, each step takes
$O (n d)$ time - n dot product of vectors in
$R^{n}$
Perceptron algorithm guarantees convergence no matter step size

Stochastic Gradient Descent

Each step, pick 1 misclassified
$X_{i}$
- do gradient descent on loss fn
  $L (X_{i}^{⊤} w, y_{i})$
Known as the perceptron algorithm
Each step takes
$O (d)$ time
while
$y_{i} X_{i}^{⊤} w < 0$ :
- $w \leftarrow w + ϵ y_{i} X_{i}$
return w

Fictitious dimension

If separating hyperplane does not pass origin, add a fictitious dimension
$f (x) = w^{⊤} x + α = [\begin{matrix} w_{1} & w_{2} & α \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ 1 \end{matrix}]$
Sample pts in
$R^{d + 1}$ all lying on hyerplane
$x_{d + 1} = 1$
Run perceptron algorithm in
$d + 1$ dimension

Hard-margin SVM

Margin of a linear classifier is dist from deicision boundary to nearest sample pt
Try to make margin as wide as possible
If
$| | w | | = 1$ , signed dist is
$w^{⊤} X_{i} + α$
- Otherwise signed dist is
  ${\frac{w}{| | w | |}}^{⊤} X_{i} + \frac{α}{| | w | |}$
Hence the margin is
$min_{i} \frac{1}{| | w | |} | w^{⊤} X_{i} + α | \geq \frac{1}{| | w | |}$
QP in
$d + 1$ dimensions,
$n$ constraints
Objective:
$min_{w, α} | | w | |_{2}^{2}$
Constraints:
$y_{i} (w^{⊤} x_{i} + α) \geq 1$ for
$i = 1 \dots n$
- Impossible for
  $w$ to be set to 0
Has a unique solution (if any)
Maximum margin classifier aka hard-margin SVM
Margin =
$\frac{1}{| | w^{*} | |}$
Slab = 2*margin, margin is dist from boundary to nearest pt

Lecture 4: 2/1/21

Soft Margin SVM

Margin is no longer dist from decision boundary to nearest pt
- Just
  $\frac{1}{| | w | |}$
Prevent abuse of slack by adding a loss term
Objective: Find
$w, α, ξ_{i}$ to
$min | | w | |^{2} + C \sum_{i = 1}^{n} ξ_{i}$
Constraints:
$y_{i} (X_{i}^{⊤} w + α) \geq 1 - ξ_{i}$
- $ξ_{i} \geq 0$
QP in
$d + n + 1$ dimensional space and
$2 n$ constraints
There is a tradeoff between C and
$| | w | |^{2}$
- $C > 0$ is a hyperparameter scalar

Column 1	Small C	Big C
Desire	Max margin $\frac{1}{\| w \|}$	Keep most slack variables 0 or small
Danger	Underfitting (misclassifies training data)	Overfitting (awesome training, awful test)
Outliers	Less sensitive	Very sensitive
Boundary	More "flat"	More sinuous/curvy

Features

Make nonlinear features that lift points to a higher dimensional space
High-d linear classifier <–> low-d nonlinear classifier
Parabolic lifting map:
$Φ : R^{d} \to R^{d + 1}$
- $Φ (x) = [\begin{matrix} x \\ ‖ x ‖^{2} \end{matrix}]$
- Lifts x onto paraboloid
  $x_{d + 1} = ‖ x ‖^{2}$
- Linear classifier in
  $Φ$ space induces a sphere classifier in
  $x$ space
Theorem:
$Φ (X_{1}), \dots Φ (X_{n})$ are linearly separable iff
$X_{1}, \dots X_{n}$ are separable by a hypersphere
Proof: Consider hypersphere in
$R^{d}$ center c, radius
$ρ$
$‖ x - c ‖^{2} < ρ^{2}$
$‖ x ‖^{2} - 2 c^{⊤} x + ‖ c ‖^{2} < ρ^{2}$
$[\begin{matrix} - 2 c^{⊤} & 1 \end{matrix}] [\begin{matrix} x \\ ‖ x ‖^{2} \end{matrix}] < ρ^{2} - ‖ c ‖^{2}$
Normal vector in
$R^{d + 1}$ ,
$Φ (x)$
Points inside sphere means same side of hyperplane in
$Φ$ space

Axis-aligned ellipsoid/hyperboloid

$Φ : R^{d} \to R^{2 d}$
$Φ (X) = {[\begin{matrix} x_{1}^{2} & x_{2}^{2} & x_{3}^{2} & \dots & x_{d}^{2} & x_{1} & x_{2} & \dots & x_{d} \end{matrix}]}^{⊤}$
3D:
$A x_{1}^{2} + B x_{2}^{2} + C x_{3}^{2} + D x_{1} + E x_{2} + F x_{3} + α = 0$
Hyerplane is
$w^{⊤} Φ (x) + α = 0$
- $w = [\begin{matrix} A & B & C & D & E & F \end{matrix}]$

Ellipsoid/hyperboloid

$Φ (x) : R^{d} \to R^{(d^{2} + 3 d) / 2}$
3D:
$A x_{1}^{2} + B x_{2}^{2} + C x_{3}^{2} + D x_{1} x_{2} + E x_{2} x_{3} + F x_{3} x_{1} + G x_{1} + H x_{2} + I x_{3} + α = 0$
Isosurface dfined by this equation is a quadric (2d: conic section)

Degree-p polynomial

Ex. cubic in
$R^{2}$
$Φ (x) = {[\begin{matrix} x_{1}^{3} & x_{1}^{2} x_{2} & x_{1} x_{2}^{2} & x_{2}^{3} & x_{1}^{2} & x_{1} x_{2} & x_{2}^{2} & x_{1} & x_{2} \end{matrix}]}^{⊤}$
$Φ (x) : R^{d} \to R^{O (d^{p})}$
Linear has smaller margin, higher degree gives wider margins
Going to a dimension high enough makes it linearly separable, but could overfit

Edge Detection

Edge detector: algorithm for approximating grayscale/color gradients in images
- tap filter, sobel filter, oriented Gaussian derivative filter
Compute gradients pretending it is a continuous field
Collect line orientations in local histograms (each having 12 orientations), use histograms as features instead of raw pixels

Lecture 5: 2/3/21

ML Abstractions

Application/data:
- Is data labeled/classified?
- Yes: Categorical (classification), quantitative(regression)
- No: Similarity (clustering), positioning (dimensionality reduction)
Model:
- Decision functions: linear, polynomial, logistic
- nearest neighbors, decision trees
- Features
- Low vs high capacity (linear has low, poly have high, neural nets have high): affect overfitting, underfitting, inference
- Capacity - a measure of how complicated your model can get
Optimization Problem
- Variables, objective fn, constraints
  - Ex. unconstrainted, convex programs, least squares, PCA
Optimization Algorithm
- Gradient descent, simplex, SVD

Optimization Problems

Unconstrained: Find w that minimizes a continuous objective fn
$f (w)$ , f is smooth if gradient is continuous too
Global min:
$f (w) \leq f (v), \forall v$
Local min:
$f (w) \leq f (v), \forall v$ in tiny ball centered at w
Finding local min is easy, gloal min is hard
Except for convex functions
- Line segment connecting 2 points does not go below f
- Perceptron risk fn is convex and nonsmooth
A convex fn has either:
- no min (goes to
  $- \infty$ )
- just one local min which is the global min
- Connected set of local min which are all global min
Algorithms for smooth f
- Gradient descent
  - blind : repeat
    $w \leftarrow w - ϵ \nabla f (w)$
  - stochastic (just 1 training pt at a time):
    $w \leftarrow w - ϵ L (y_{i}, z_{i})$
  - line search
- Newton's method (needs Hessian matrix)
- Nonlinear conjugate gradient
Algorithms for nonsmooth f
- Gradient descent
- BFGS
These algs find local min
Line search: Find local min along search direction by solving an optimization problem in 1D
- Secant method (smooth only)
- Newton-Raphson (smooth only, may need Hessian)
- Direct line search (non smooth) like golden section search
Constrained Optimization (smooth equality constraints)
- Goal: Find w that minimizes
  $f (w)$ subject to
  $g (w) = 0$ where g is smooth
- Alg: Use lagrange multipliers
Linear Program
- Linear objective, linear ineq constraints
- Goal: Find w that maximizes
  $c^{⊤} w$ subject to
  $A w \leq b$
  - $A_{i}^{⊤} w \leq b_{i}$ for
    $i \in [1, n]$
Feasible region: Set of pts that satisfy all constraints
- LP feasible region is a polytope (not always bounded/finite)
Point set P is convex if every pair of points' line segment lies fully in P
Active constraints: constraints that achieve equality at optimum

Lecture 6: 2/8/21

Decision theory/risk minimization
- Multiple sample pts w/ diff classes can lie at same pt
- Want a probabilistic classifier
Example: 10% population has cancer, 90% doesn't
Posterior probability:
$P (Y = 1 | X)$
Prior probability:
$P (Y = 1)$
Loss function
$L (z, y)$ where z is prediction, y is true label
- Loss function specifies how bad it is if classifier predicts z but true class is y
- 36% loss of 5 is worse than 64% loss of 1, so recommend further cancer screening
- 0-1 loss function is 1 for incorrect, 0 for correct
- Loss function does not need to be convex
Risk for decision rule r is expected loss over all values of x,y
- $R (r) = E [L (r (X), Y)]$
- $\sum_{x} (L (r (x), 1) P (Y = 1 | X = x) + L (r (x), - 1) P (Y = - 1 | X = x)) P (X = x)$
Bayes decision rule/Bayes classifier is fn
$r^{*}$ that minimizes functional
$R (r)$

If L is symmetric, pick the class with the biggest posterior probability
Bayes risk/optimal risk is risk of Bayes classifier
No decision rule that gives lower risk
Deriving/using
$r^{*}$ is called risk minimization
Expected values of
$g (X) = E [g (X)] = \int_{- \infty}^{\infty} g (x) f (x) d x$
Mean:
$μ = \int_{- \infty}^{\infty} x f (x) d x$
If L is 0-1 loss,
$R (r)$ is probability that r(x) is wrong
- Bayes optimal decision boundary is
  ${x : P (Y = 1 | X = x) = 0.5}$

3 ways to build classifiers

Generative models (ex. LDA: Linear Discriminant Analysis)
- Assume sample pts come from probability distributions, different for each class
- Guess form of distribution
- For each class C, fit distribution params to class C pts, giving
  $f (X | Y = c)$
- For each c, estimate
  $P (Y = c)$
- Bayes thm gives
  $P (Y | X)$
- If 0-1 loss, pick class C that maximizes
  $P (Y = c | X = x)$ equivalently maximizes
  $f (X = x | Y = c) P (Y = c)$
Discriminative models (ex. Logistic regression)
- Skip prior probabilities, model
  $P (Y | X)$ directly
Find decision boundary (ex. SVM)
Advantages of 1,2:
$P (Y | X)$ tells you probability that your guess is wrong
Advantage of 1: Diagnose outliers if P(X) is very small
Disadvantage of 1: Often hard to estimate distributions correctly, real distributions rarely match standard ones

Lecture 7: 2/10/21

Gaussian Discriminant Analysis (GDA): Each class comes from a normal distribution
Isotropic normal dist:
$X \sim N (μ, σ^{2})$ :
$f (x) = \frac{1}{(\sqrt{2 π} σ)^{d}} \exp (- \frac{‖ x - μ ‖^{2}}{2 σ^{2}})$
- $μ, x$ are length d vectors,
  $σ$ is a scalar
- For each class C, estimate mean
  $μ_{c}$ , variance
  $σ_{c}^{2}$ , prior
  $π_{c} = P (Y = C)$
- Given x, Bayes decision rule
  $r^{*} (x)$ predicts class C that maximizes
  $f (X = x | Y = C) π_{C}$
$Q_{c} (x) = \ln ((\sqrt{2 π})^{d}) f_{c} (x) π_{c}) = - \frac{‖ x - μ ‖^{2}}{2 σ^{2}} - d \ln (σ_{c}) + \ln (π_{c})$
- Quadratic in x, easier to optimize
- Find
  $\arg max_{c} Q_{c} (x)$
Baye's decision boundary is where
$Q_{c} (x) - Q_{d} (x) = 0$
- In 1D, may have 1 or 2 or 0 pts
Use Bayes' to recover posterior probabilities in 2 clsas
- $P (Y = C | X) = \frac{f (X | Y = C) π_{c}}{f (X | Y = C) π_{c} + f (X | Y = D) π_{d}}$
- $= s (Q_{C} (x) - Q_{D} (x))$
Logistic/sigmoid function:
$s (γ) = \frac{1}{1 + e^{- γ}}$

Linear Discriminant Analysis (LDA)

Variant of QDA that has linear decision boundaries
- Less likely to overfit
- Less true probabilistic estimation
- Use validation of both methods to see which is better
Assume that all gaussians have same variance
$Q_{C} (x) - Q_{D} (x) = {(\frac{(μ_{C} - μ_{D})}{σ^{2}})}^{⊤} x + (- \frac{‖ μ_{C} ‖^{2} - ‖ μ_{D} ‖^{2}}{2 σ^{2}} - \ln (π_{c}) - \ln (π_{d}))$
Choose C that maximizes linear discriminant fn
- ${(\frac{μ_{C}}{σ^{2}})}^{⊤} x - \frac{‖ μ_{C} ‖^{2}}{2 σ^{2}} + \ln (π_{c})$
Generative model as opposed to discriminative model

Maximum Likelihood Estimation (MLE)

Flip biased coin, heads with prob
$p$ and tails with prob
$1 - p$
Goal: estimate
$p$
num of heads are
$X \sim B (n, p)$ , binomial dist
$L$ is the likelihood fn
- $L (p) = P (X = 8) = 45 p^{8} (1 - p)^{2}$
- Should be as a function of distribution parameters
MLE is a method of estimating params of a statistical distribution by picking params that maximize
$L$
- Is a method of density estimation: estimating PDF from data
- Find p that maximizes
  $L (p)$
Likelihood of Gaussian
- $L (μ, σ; X_{1}, X_{2}, \dots X_{n}) = f (X_{1}) f (X_{2}) \dots f (X_{n})$
- $l$ is
  $\ln$ likelihood of
  $L$
- Maximizing l and log likelihood are same
- $l (μ, σ; X_{1}, X_{2}, \dots X_{n}) = \ln f (X_{1}) + \ln f (X_{2}) \dots + \ln f (X_{n})$
- $= \sum_{i = 1}^{n} \frac{‖ X_{i} - μ ‖^{2}}{2 σ^{2}} - d \ln \sqrt{2 π} - d \ln σ)$
  - Which is a summation of ln of normal PDF
- Want to set
  $\nabla_{μ} l = 0, \frac{\partial l}{\partial σ} = 0$
  - $\nabla_{μ} l = 0 ⟹ \hat{μ} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$
  - $\frac{\partial l}{\partial σ} = 0 ⟹ {\hat{σ}}^{2} = \frac{1}{d n} \sum_{i = 1}^{n} ‖ X_{i} - μ ‖^{2}$
    - Use
      $\hat{μ}$ for
      $μ$
Tells us to use sample mean and variance of pts in class C to estimate mean and variance of Gaussian for class C
For QDA: estimate conditional mean
$\hat{μ_{c}}$ and conditional variance
${\hat{σ_{c}}}^{2}$ of each class C separately
- $\hat{π_{c}} = \frac{n_{c}}{\sum n_{D}}$
  - Denominator is total sample pts in all classes
For LDA: SAme means and priors, 1 variance for all classes
- Pooled within-class variance:
  ${\hat{σ}}^{2} = \frac{1}{d n} \sum_{C} \sum_{{i : y_{i} = c}} ‖ X_{i} - {\hat{μ}}_{c} ‖^{2}$

Lecture 8: 2/17/21

For
$v, λ, A$ , v is an evec of
$A^{k}$ with
$λ^{k}$
If A is invertible,
$A^{- 1}$ has evec v with
$\frac{1}{λ}$
Spectral theorem: Every real symmetric nxn matrix has real eigenvalues and n mutually orthogonal evecs
$‖ A^{- 1} x ‖^{2} = 1$ isan ellpisoid with axes
$v_{1}, \dots v_{n}$ and radii
$λ_{1}, \dots λ_{n}$
Special case if A is diagonal, evecs are coordinate axes and ellipsoid is axis-aligned
Symmetric matrix is:
- PD: All pos eval
- PSD: All nonneg evals
- Indefinite matrix: has at least 1 pos eval and 1 neg eval
- Invertible: Has no zero eigenvalue
$x^{⊤} A^{- 2} x$ will be PD bc evecs are squared
Orthonormal matrix: acts like rotation or reflection
Thm:
$A = V Λ V^{⊤} = \sum_{i = 1}^{n} λ_{i} v_{i} v_{i}^{⊤}$

Lecture 8: 2/22/21

Anisotropic Gaussians

Normal PDF:
$f (x) = n (q (x)) = \frac{1}{\sqrt{(2 π)^{d} | Σ |}} \exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))$
- $n (q) = \frac{1}{\sqrt{(2 π)^{d} | Σ |}} e^{- q / 2}$
  - $R \to R$ (exponential)
- $q (x) = (x - μ)^{⊤} Σ^{- 1} (x - μ)$
  - $R^{d} \to R$ (quadratic)
Covariance matrix:
$Σ = V Γ V^{⊤}$
- Eigenvalues of
  $Σ$ are variances along evecs
  $Γ_{i i} = σ_{i}^{2}$
$Σ^{\frac{1}{2}} = V Γ^{\frac{1}{2}} V^{⊤}$
- Maps spheres to ellipsoids
- Eigenvalues of
  $Σ^{\frac{1}{2}}$ are stdev
  $Γ_{i i}^{\frac{1}{2}} = σ_{i}$ (Gaussian width/ellipsoid radii)

MLE for Anisotropic Gaussians

Given sample pts and classes, find best-fit Gaussians
QDA:
${\hat{Σ}}_{c} = \frac{1}{n_{c}} \sum_{i : y_{i} = C} (X_{i} - {\hat{μ}}_{c}) (X_{i} - {\hat{μ}}_{c})^{⊤}$
- Conditional covariance (for pts in class C)
- Outer product, dxd matrix
- ${\hat{π}}_{c}, {\hat{μ}}_{c}$ same as before
sigmac always PSD but not always PD
Choose C to maximize discriminant fn
Decision fn is quadratic, may be indefinite
LDA interpeted as projecting points on a line
For 2 classes
- LDA has d+1 parameters (w,
  $α$ )
- QDA has
  $\frac{d (d + 3)}{2} + 1$ params
- QDA more likely to overfit
- With added features LDA gan give nonlinear, QDA nonquadratic
- We don't get true optimum Bayes classifier (estimate distributions from finite data)
- Changing priors/loss = adding constants to discriminant fns
- Posteroir gives decision boundaries for 10% probability, 50%, 90%
  - Choosing isovalue = p same as
  - Choosing asymmetric loss p for false pos, 1-p for false neg OR
    $π_{c} = 1 - p, π_{d} = p$

Terminology

Let X be n x d design matrix of sample pts
Each row i of X is a sample pt
$X_{i}^{⊤}$
Centering X: Subtract
$μ^{⊤}$ from each row of X
$X \to \dot{X}$
Let R be uniform distribution on sample pts
Sample cov matrix is
$V a r (R) = \frac{1}{n} {\dot{X}}^{⊤} \dot{X}$
Decorrelating
$\dot{X}$ : Apply rotation
$Z = \dot{X} V$
- $V a r (R) = V Λ V^{⊤}$
- $V a r (Z) = Λ$
- Sphering
  $\dot{X}$ :
  $W = \dot{X} V a r (R)^{- 1 / 2}$
Whitening X: Centering + sphering
- W has cov matrix I

Lecture 9: 2/24/21

Regression: Fitting curves to data
Classification: Given pt, predict its class
Regression: Given pt, predict numerical value
Choose form of regerssion fn
$h (x; p)$ with parameter p
- Like decision fn in classification
Choose a cost fn (objective fn) to optimize
- Usually based on a loss fn
Regression fns:
- 1. Linear:
    $h (x; w, α) = w^{⊤} x + α$
- 1. Polynomial: Add features for linear
- 1. Logistic:
    $h (x : w, α) = s (w^{⊤} x + α)$
  - $s (γ) = \frac{1}{1 + e^{- γ}}$
Loss fns: z is prediction
$h (x)$ , y is true label
- A) Squared error:
  $L (z, y) = (z - y)^{2}$
- B) Absolute error:
  $L (z, y) = | z - y |$
- C) Logistic loss/cross-entropy:
  $L (z, y) = - y \ln (z) - (1 - y) \ln (1 - z)$
Cost fns to minimize
- a) Mean loss:
  $J (h) = \frac{1}{n} \sum_{i = 1}^{n} L (h (X_{i}), y_{i})$
- b) Max loss:
  $J (h) = max_{i = 1 \dots n} L (h (X_{i}), y_{i})$
- c) Weighted sum:
  $J (h) = \sum_{i = 1}^{n} ω_{i} L (h (X_{i}), y_{i})$
- d)
  $l_{2}$ penalized/regularized:
  $J (h) = \dots + λ ‖ w ‖_{2}^{2}$
  - $\dots$ is one of the previous options
- e)
  $l_{1}$ penalized/regularized:
  $J (h) = \dots + λ ‖ w ‖_{1}$
Famous regression methods:
- Least squares linear regression: 1Aa
  - Quadratic cost, min w/ calculus
- Weighted least-squares linear: 1Ac
  - Quadratic cost, min w/ calculus
- Ridge regression: 1Ad
  - Quadratic cost, min w/ calculus
- LASSO: 1Ae
  - QP w/ exponential num of constraints
- Logistic regression: 3Ca
  - Convex cost fn, min w/ gradient descent/Newton's
- Least absolute deviations: 1Ba
  - LP
- Chebyshev criterion: 1Bb
  - LP

Least Squares Linear Regression

Linear regression fn (1) + squared loss fn (A) + cost fn (a)
Optimization problem:
$min_{w, α} \sum_{i = 1}^{n} (X_{i}^{⊤} w + α - y_{i})^{2}$
X is n x d design matrix of sample pts
y is n-vector of scalar labels
$X_{i}^{⊤}$ transposes the column vector point
Feature column:
$X_{* j}$
Usually
$n > d$
X is now n x (d+1) matrix, w is a (d+1) vector
$[\begin{matrix} x_{1} & x_{2} & 1 \end{matrix}] [\begin{matrix} w_{1} \\ w_{2} \\ α \end{matrix}]$
Residual sum of squares:
$R S S (w) = min_{w} ‖ X w - y ‖^{2}$
Optimize with calculus:
- If
  $X^{⊤} X$ is singular, problem is unconstraind (all pts fall on hyperplane)
- Never overconstrained
- Linear solver to find
  $w^{*} = (X^{⊤} X)^{- 1} X^{⊤} y$
- Pseudoinverse (d+1, n):
  $X^{+} = (X^{⊤} X)^{- 1} X^{⊤}$
- Pseudoinverse is a left inverse:
  $X^{+} X = (X^{⊤} X)^{- 1} X^{⊤} X = I$
- Predicted values are
  $\hat{y} = X w = X X^{+} y = H y$
- Hat matrix:
  $H = X X^{+}$ (n x n matrix)
Advantage: Easy to compute, unique/stable solution
Disadvantage: Very sensitive to outliers bc errors are squared, fails if
$X^{⊤} X$ is singular

Logistic Regression

Logistic regerssion fn (3) + logistic loss fn © + cost fn (a)
Fits probabilities in range (0,1)
Discriminative model (as opposed to QDA/LDA being generative models)
$min_{w} J (w) = min_{w} \sum_{i = 1}^{n} L (X_{i} w, y_{i}) = min_{w} - \sum_{i = 1}^{n} (y_{i} \ln (X_{i} w) + (1 - y_{i}) \ln (1 - X_{i} w))$
$J (w)$ is convex, solve by gradient descent
$s^{'} (γ) = s (γ) (1 - s (γ))$
$s_{i} = s (X_{i} w)$ ,
$\nabla_{w} J = - X^{⊤} (y - s (X w))$ , with
$s (X w)$ applied elementwise
Gradient descent rule:
$w \leftarrow w + ϵ X^{⊤} (y - s (X w))$
SGD:
$w \leftarrow w + ϵ (y_{i} - s (X_{i} w)) X_{i}$
- Shuffle pts into random order, process one by one
- For large n, sometimes converges before visiting all pts
Logistic regression always separates linearly separable pts
$\nabla_{w}^{2} J (w) = \sum_{i = 1}^{n} s_{i} (1 - s_{i}) X_{i} X_{i}^{⊤} = X^{⊤} Ω X$
- $Ω = d i a g (s_{i} (1 - s_{i})) ≻ 0$ making it PD
- So
  $X^{⊤} Ω X ⪰ 0$ so
  $J (w)$ is convex
- Find solution to
  $(X^{⊤} Ω X) e = X^{⊤} (y - s)$
- Example of iteratively reweighted least squares
- Misclassified points far from decision boundary have large influence
- Points close to decision boundary also has medium to large influence
LDA vs Logistic regression
LDA Advantages:
- For well separated classes LDA is stable, logreg is unstable
- $> 2$ classes easy and elegant, logreg needs modifications (softmax regression)
- LDA slightly more accurate when classes nearly normal, esp if n is small
Logreg advantages:
- More emphasis on decision boundary, always separates linearly separable pts
- More robust on some non-Gaussian distributions (w/ large skew)

Lecture 10: 3/1/21

Least Squares Polynomial Regression

Replace each
$X_{i}$ with feature vector
$Φ (X_{i})$
- Ex.
  $Φ (X_{i}) = [\begin{matrix} X_{i 1}^{2} & X_{i 1} X_{i 2} & X_{i 2}^{2} & X_{i 1} & X_{i 2} & 1 \end{matrix}]$
- Can also use non polynomimal features
Logistic regression + quadratic features = some form of posteriors like QDA
Very easy to overfit

Weighted Least-Squares Regression

Assign each sample pt a weight
$ω_{i}$ , create
$Ω = d i a g (ω_{i})$
- Bigger
  $ω_{i}$ means work harder to minimize
  $| {\hat{y}}_{i} - y_{i} |^{2}$
$min_{w} (X w - y)^{⊤} Ω (X w - y) = \sum_{i = 1}^{n} ω_{i} (X_{i} w - y_{i})^{2}$
- $X^{⊤} Ω X w = X^{⊤} Ω y$ , solve for
  $w$

Newton's Method

Iterative optimization method for smooth fn
$J (w)$
Often much faster than gradient descent
Approximate
$J (w)$ by quadratic fn
$\nabla J (w) = \nabla J (v) + (\nabla^{2} J (v)) (w - v)$
$w = v - (\nabla^{2} J (v))^{- 1} \nabla J (v)$
Repeat until convergence
Warnings:
- Doesn't know diff between minima, maxima, saddle pts
- Starting pt must be close enough to desired critical point

Roc Curves

For test sets to evaluate classifier
Receiver Operator Characteristics
Rate of FP vs TP for range of settings
x-axis is False Positive rate = % of negative items incorrectly classified as pos
y-axis is True positive rate = % of pos classified as pos
Top right: always classify positive (
$P \geq 0$ )
Bottom left: Always classify negative (
$P > 1$ )
Want high area under the curve (close to 1)

Lecture 11: 3/3/21

Statistical Justifications for Regression

Sample pts come from unknown prob dist
$X_{i} \sim D$
y values are sum of unknown nonrandom fn + random noise
- $y_{i} = g (X_{i}) + ϵ_{i}, ϵ \sim D^{'}$ with
  $D^{'}$ having mean 0
Goal of regression: find h that estimates
$g$
Ideal approach: Choose
$h (x) = E_{Y} [Y | X = x] = g (x) + E [ϵ] = g (x)$

Least Squares regression for MLE

Suppose
$ϵ_{i} \sim N (0, σ^{2})$
- $y_{i} \sim N (g (X_{i}), σ^{2})$
$l (g; X, y) = \ln (F (y_{1}) f (y_{2}) f (y_{n})) = - \frac{1}{2 σ^{2}} \sum (y_{i} - g (X_{i}))^{2} - c o n s t$
- MLE over g is minimizing sum of squared errors

Empirical Risk

Risk for hypothesis h is xpected loss
$R (h) = E [L]$
Discriminative: Don't know X dist D
Empirical distribution: discrete uniform distribution over the sample pts
Empirical risk: Expected loss under empirical distribution
- $\hat{R} (h) = \frac{1}{n} \sum_{i = 1}^{n} L (h (X_{i}), y_{i})$
- So minimize sum of loss fns

Logistic Loss for MLE

Actual probability pt
$X_{i}$ is in class C is
$y_{i}$ , predicted prob is
$h (X_{i})$
Imagine
$β$ duplicate copies of
$X_{i}$
- $y_{i} β$ are in class c,
  $(q - y_{i}) β$ copies are not
- $L (h; X, y) = \prod h (X_{i})^{y_{i} β} (1 - h (X_{i}))^{(1 - y_{i}) β}$
- $min l (h) = - β \sum$ logistic loss fn
  $L (h (X_{i}), y_{i})$
  - Logistic loss is
    $- y_{i} \ln (z_{i}) - (1 - y_{i}) \ln (q - z_{i})$
- Max Likelihood –> Min sum of logistic loss fns

Bias-Variance Decomposition

2 source of error in hypothesis:
- bias: Error due to inability of hypothesis to fit g perfectly
- variance: Error due to random noise in data
Model:
$X_{i} \sim D, ϵ_{i} \sim D^{'}, y_{i} = g (X_{i}) + ϵ_{i}$
- Fit hypothesis h to X,y
- h is a RV with random weights
Consider arbitrary pt
$z, γ = g (z) + ϵ$
- $E [γ] = g (z), v a r (γ) = v a r (ϵ)$
Risk fn when loss = squared error:
$r (h) = E [L (h (z), γ)]$
- Taking expectation over all possible training sets X,y, values of
  $g a m m a$
- $R (h) = E [(h (z) - γ)^{2}]$
- $R (h) = (E [h (z)] - g (z))^{2} + V a r (h (z)) + V a r (ϵ)$
  - First term: bias^2
  - Second term: variance of method
  - Third term: Irreducible error
Underfitting = too much bias
Most overfitting = too much variance
Training error reflects bias but not variance
Most dist var
$\to 0$ as
$n \to \infty$
If h can fit g exactly, many distributions bias
$\to 0$ as
$n \to \infty$
If h cannotfit g well, bias is large for most points
Adding a good feature reduces bias, adding a bad feature rarely increases it
- Good features reduce bias more than it increases variance
Adding a feature usually increases variance
Noise in test set affects only var(
$ϵ$ )
- Noise in training set affects only bias and var of h
Can't precisely measure bias and variance of real world data
Can test learning algs by choosing g and making synthetic data

Lecture 12: 3/8/21

Ridge Regression

1 + A + d
- l2 penalized mean loss
$min ‖ X w - y ‖^{2} + λ ‖ w^{'} ‖^{2}$
- $w^{'}$ is
  $w$ with component
  $α$ replaced by zero
- Don't penalize
  $α$
- Regularization term/penality term for shrinkage
- Small
  $‖ w^{'} ‖$ guarantees PD normal eqs
- Reduces overfitting by reducing variance by penalizing large weights
- Normal equations:
  $(X^{⊤} X + λ I^{'}) w = X^{⊤} y$
  - $I^{'} ≜ I$ with bottom right = 0
Solve for w, return
$h (z) = w^{⊤} z$
Increasing
$λ$ means more regularization
For
$y = X v + e$ , e being noise
- Variance of RR is
  $v a r (z^{⊤} (X^{⊤} X + λ I^{'})^{- 1} X^{⊤} e)$

Bayesian Justification

Prior probability:
$w^{'} \sim N (0, σ^{2})$
Apply MLE
Max log posterior:
$\ln (L (w)) + \ln (f (w^{'})) - c o n s t$
- Min
  $min ‖ X w - y ‖^{2} + λ ‖ w^{'} ‖^{2}$

Feature Subset Selection

All features increase variance, not all features reduce bias
Idea: Identify poorly predictive features, ignore them (weight 0)
- Less overfitting, smaller test errors
- 2nd motivation: inference
Usually hard b/c features can partly encode same info
Algorithm: Try all
$2^{d} - 1$ subsets of features
- Slow, use cross validation
Heuristic 1: Forward stepwise selection
- Start with null model (0 features)
- Repeatedly add best feature until validation error increases
- Train
  $O (d^{2})$ models
- Not perfect, won't find best 2-features model if neither yields best 1-feature model
Heuristic 2: Backward stepwise selection
- Start with all d features, repeatedely remove features that give best reduction in val error
- Trains
  $O (d^{2})$ models

LASSO

Regression w/ regularization: 1 + A + e
- l1 penalized mean loss
- $min ‖ X w - y ‖^{2} + λ ‖ w^{'} ‖_{1}$
Algs: subgradient descent, least-angle regression

CS189 MT Notes

Lecture 1: 1/20/21

Lecture 2: 1/25/21

Math Review

Centroid method

Perceptron Algorithm

Lecture 3: 1/27/21

Stochastic Gradient Descent

Fictitious dimension

Hard-margin SVM

Lecture 4: 2/1/21

Soft Margin SVM

Features

Axis-aligned ellipsoid/hyperboloid

Ellipsoid/hyperboloid

Degree-p polynomial

Edge Detection

Lecture 5: 2/3/21

ML Abstractions

Optimization Problems

Lecture 6: 2/8/21

3 ways to build classifiers

Lecture 7: 2/10/21

Linear Discriminant Analysis (LDA)

Maximum Likelihood Estimation (MLE)

Lecture 8: 2/17/21

Lecture 8: 2/22/21

Anisotropic Gaussians

MLE for Anisotropic Gaussians

Terminology

Lecture 9: 2/24/21

Least Squares Linear Regression

Logistic Regression

Lecture 10: 3/1/21

Least Squares Polynomial Regression

Weighted Least-Squares Regression

Newton's Method

Roc Curves

Lecture 11: 3/3/21

Statistical Justifications for Regression

Least Squares regression for MLE

Empirical Risk

Logistic Loss for MLE

Bias-Variance Decomposition

Lecture 12: 3/8/21

Ridge Regression

Bayesian Justification

Feature Subset Selection

LASSO

Read more

CS61B MT2 Cheat Sheet

Puzzle Game 2022 Sols

Write It Powerpoint It Recap

CS188 Notes Part 1