# CS189 MT Notes
# Lecture 1: 1/20/21
* Decision boundary has `d-1` dimensions
* Training set error: Fraction of training images not classified correctly
* Test set error: Fraction of misclassified NEW images, not seen during training
* Outliers: points whose labels are atypical
* Overfitting: When test error deteroirates because classifier is sensitive to outliers
* Validation data is taken from training data
* Train classifier with different hyperparams
* Training set: Used to learn model weights
* Validation set: Used to tune hyperparams
* Test set: Used as final evaluation of model
* Supervised learning:
* Classification: Is this email spam?
* Regression: How likely does this patient have cancer?
* Unsupervised learning:
* Clustering: Which DNA sequences are similar?
* Dimensionality reduction: What are common features of faces?
# Lecture 2: 1/25/21
* Decision boundary: Boundary chosen by classifer to separate items
* Overfitting: When decision boundary fits sample points so well that it does not classify future points well
* Some classifiers work by computing a decision/predictor/discriminant function
* A function that maps a pt x to a scalar st $f(x) > 0 \implies x \in C$ and vice versa. Pick one for the 0 case
* Decision boundary is $\{ x \in \mathbb{R}^d \colon f(x) = 0 \}$
* This set is in $d-1$ dimensional space (isosurface of f, iso value = 0)
* Decision boundary is a linear plane (linear decision function)
* $f(x) = w^\top x + \alpha$,
* Hyperplane: $H = \{ x: w^\top x = -\alpha\}$
* $w$ is the normal vector of H (perpendicular to H)
* If $w$ is unit vector, $w^\top x + \alpha$ is the signed distance from x to H
* Distance from H to origin is $\alpha$, $alpha = 0 \longleftrightarrow$ passes through origin
* Coefficients in $w, \alpha$ are weights/params
* Linearly separable - if there exists a hyperplane that separates all sample pts in class C from all pts not in C
## Math Review
* Vectors default to be column vectors
* Uppercase roman = matrix, RVs, sets
* Lowercase roman = vectors
* Greek = scalar
* Other scalars:
* n = # of sample pts
* d = # of features (per point) aka dimensionality
* i, j, k are indices
* functions (ofen scalar): $f(\cdot)$
## Centroid method
* Centroid method: compute mean $\mu_c$ of all pts in class C, mean $\mu_x$ of all pts not in C
* Use decision fn $f(x) = (\mu_c - \mu_x)^\top x - (\mu_c - \mu_x)\frac{\mu_c + \mu_x}{2}$
* $\alpha$ is such that $f(x) = 0$ if it is the midpoint between $\mu_c, \mu_x$
## Perceptron Algorithm
* Slow, but correct for linearly sep pts
* Numerical optimization algorithm - gradient descent
* For n sample pts $X_1, \ldots X_n$ with labels $y_i \in \{-1, 1\}$
* Consider decision boundary that passes through origin
* Goal: find weights w/ constraint: $y_iX_i^\top w \geq 0$
* $X_i^\top w \geq 0$ if $y_i = 1$
* $X_i^\top w \leq 0$ if $y_i = -1$
* Objective/cost/risk function R:
* positive if a constraint is violated
* Use optimization to choose w that minimizes R
* Define loss function
* $L(z, y_i) = \begin{cases} 0 & y_i z \geq 0\\ -y_i z & \text{otherwise} \end{cases}$
* $z = X_i^\top w$ is classifier prediction, $y_i$ is correct label
* If $z$ has same sign as $y_i$, loss fn is zero
* If z has wrong sign, loss fn is positive
* Risk/objective/cost fn:
* $R(w) = \frac{1}{n} \sum\limits_{i=1}^n L(X_i^\top w, y_i)$
* $R(w) = \frac{1}{n} \sum\limits_{i \in V} -y_iX_i^\top w$ with $V$ having all i st $y_iX_i^\top w < 0$
* $R(w) = 0$ if all X classified correctly, otherwise positive
* Goal: Find w that minimizes $R(w)$
* $w=0$ is useless, don't want that
# Lecture 3: 1/27/21
* Gradient descent: Find gradient of R wrt w, take step in opposite direction
* $\nabla R(W) = \begin{bmatrix} \frac{\partial R}{\partial w_1} & \cdots & \frac{\partial R}{\partial w_d}\end{bmatrix}^\top$
* $\nabla_w(z^\top w) = z$
* $\nabla R(W) = - \sum\limits_{i \in V} y_i X_i$
* Take step in $-\nabla R(W)$ direction
* Algorithm:
* $w \leftarrow$ arbitrary nonzero vector (good choice is any $y_i X_i$)
* while $R(w) > 0$:
* $V \leftarrow$ set of indices that $y_iX_i^\top w < 0$
* $w \leftarrow w + \epsilon \sum\limits_{i \in V} y_i X_i$
* return w
* $\epsilon > 0$ is the step size/learning rate
* Chosen empirically
* Cons: Slow, each step takes $\mathcal{O}(nd)$ time - n dot product of vectors in $\mathbb{R}^n$
* Perceptron algorithm guarantees convergence no matter step size
## Stochastic Gradient Descent
* Each step, pick 1 misclassified $X_i$
* do gradient descent on loss fn $L(X_i^\top w, y_i)$
* Known as the perceptron algorithm
* Each step takes $\mathcal{O}(d)$ time
* while $y_iX_i^\top w < 0$:
* $w \leftarrow w + \epsilon y_i X_i$
* return w
## Fictitious dimension
* If separating hyperplane does not pass origin, add a fictitious dimension
* $f(x) = w^\top x + \alpha = \begin{bmatrix}w_1 & w_2 & \alpha \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ 1 \end{bmatrix}$
* Sample pts in $\mathbb{R}^{d+1}$ all lying on hyerplane $x_{d+1} = 1$
* Run perceptron algorithm in $d+1$ dimension
## Hard-margin SVM
* Margin of a linear classifier is dist from deicision boundary to nearest sample pt
* Try to make margin as wide as possible
* If $||w|| = 1$, signed dist is $w ^\top X_i + \alpha$
* Otherwise signed dist is $\frac{w}{||w||} ^\top X_i + \frac{\alpha}{||w||}$
* Hence the margin is $\min\limits_i \frac{1}{||w||} |w^\top X_i + \alpha| \geq \frac{1}{||w||}$
* QP in $d+1$ dimensions, $n$ constraints
* Objective: $\min\limits_{w, \alpha} ||w||^2_2$
* Constraints: $y_i (w^\top x_i + \alpha) \geq 1$ for $i = 1 \ldots n$
* Impossible for $w$ to be set to 0
* Has a unique solution (if any)
* Maximum margin classifier aka hard-margin SVM
* Margin = $\frac{1}{||w^*||}$
* Slab = 2*margin, margin is dist from boundary to nearest pt
# Lecture 4: 2/1/21
## Soft Margin SVM
* Margin is no longer dist from decision boundary to nearest pt
* Just $\frac{1}{||w||}$
* Prevent abuse of slack by adding a loss term
* Objective: Find $w, \alpha, \xi_i$ to $\min ||w||^2 + C\sum\limits^n_{i=1} \xi_i$
* Constraints: $y_i (X_i^\top w + \alpha) \geq 1 - \xi_i$
* $\xi_i \geq 0$
* QP in $d+n+1$ dimensional space and $2n$ constraints
* There is a tradeoff between C and $||w||^2$
* $C > 0$ is a hyperparameter scalar
| Column 1 | Small C | Big C |
| -------- | -------- | -------- |
| Desire | Max margin $\frac{1}{\|w\|}$ | Keep most slack variables 0 or small |
| Danger | Underfitting (misclassifies training data) | Overfitting (awesome training, awful test) |
Outliers | Less sensitive | Very sensitive
Boundary | More "flat" | More sinuous/curvy
## Features
* Make nonlinear features that lift points to a higher dimensional space
* High-d linear classifier <--> low-d nonlinear classifier
* Parabolic lifting map: $\Phi \colon \mathbb{R}^d \rightarrow \mathbb{R}^{d+1}$
* $\Phi(x) = \begin{bmatrix} x \\ \|x\|^2 \end{bmatrix}$
* Lifts x onto paraboloid $x_{d+1} = \|x\|^2$
* Linear classifier in $\Phi$ space induces a sphere classifier in $x$ space
* Theorem: $\Phi(X_1), \ldots \Phi(X_n)$ are linearly separable iff $X_1, \ldots X_n$ are separable by a hypersphere
* Proof: Consider hypersphere in $\mathbb{R}^d$ center c, radius $\rho$
* $\|x-c\|^2 < \rho^2$
* $\|x\|^2 - 2c^\top x + \|c\|^2 < \rho^2$
* $\begin{bmatrix} -2c^\top & 1\end{bmatrix} \begin{bmatrix} x \\ \|x\|^2\end{bmatrix} < \rho^2 - \|c\|^2$
* Normal vector in $\mathbb{R}^{d+1}$, $\Phi(x)$
* Points inside sphere means same side of hyperplane in $\Phi$ space
## Axis-aligned ellipsoid/hyperboloid
* $\Phi \colon \mathbb{R}^d \rightarrow \mathbb{R}^{2d}$
* $\Phi(X) = \begin{bmatrix} x_1^2& x_2^2& x_3^2 &\ldots& x_d^2& x_1 &x_2 &\ldots& x_d \end{bmatrix}^\top$
* 3D: $Ax_1^2 + B x_2^2 + Cx_3^2 + Dx_1 + Ex_2 + Fx_3 + \alpha = 0$
* Hyerplane is $w^\top \Phi(x) + \alpha = 0$
* $w = \begin{bmatrix} A & B & C&D&E&F \end{bmatrix}$
## Ellipsoid/hyperboloid
* $\Phi(x) \colon \mathbb{R}^d \rightarrow \mathbb{R}^{(d^2+3d)/2}$
* 3D: $Ax_1^2 + B x_2^2 + Cx_3^2 + Dx_1x_2 + Ex_2x_3 + Fx_3x_1 + Gx_1 + Hx_2 + Ix_3 + \alpha = 0$
* Isosurface dfined by this equation is a quadric (2d: conic section)
## Degree-p polynomial
* Ex. cubic in $\mathbb{R}^2$
* $\Phi(x) = \begin{bmatrix} x_1^3 & x_1^2 x_2 & x_1x_2^2 & x_2^3 & x_1^2 & x_1x_2 & x_2^2 & x_1 & x_2 \end{bmatrix}^\top$
* $\Phi(x) \colon \mathbb{R}^d \rightarrow \mathbb{R}^{\mathcal{O}(d^p)}$
* Linear has smaller margin, higher degree gives wider margins
* Going to a dimension high enough makes it linearly separable, but could overfit
## Edge Detection
* Edge detector: algorithm for approximating grayscale/color gradients in images
* tap filter, sobel filter, oriented Gaussian derivative filter
* Compute gradients pretending it is a continuous field
* Collect line orientations in local histograms (each having 12 orientations), use histograms as features instead of raw pixels
# Lecture 5: 2/3/21
## ML Abstractions
* Application/data:
* Is data labeled/classified?
* Yes: Categorical (classification), quantitative(regression)
* No: Similarity (clustering), positioning (dimensionality reduction)
* Model:
* Decision functions: linear, polynomial, logistic
* nearest neighbors, decision trees
* Features
* Low vs high capacity (linear has low, poly have high, neural nets have high): affect overfitting, underfitting, inference
* Capacity - a measure of how complicated your model can get
* Optimization Problem
* Variables, objective fn, constraints
* Ex. unconstrainted, convex programs, least squares, PCA
* Optimization Algorithm
* Gradient descent, simplex, SVD
## Optimization Problems
* Unconstrained: Find w that minimizes a continuous objective fn $f(w)$, f is smooth if gradient is continuous too
* Global min: $f(w) \leq f(v), \forall v$
* Local min: $f(w) \leq f(v), \forall v$ in tiny ball centered at w
* Finding local min is easy, gloal min is hard
* Except for convex functions
* Line segment connecting 2 points does not go below f
* Perceptron risk fn is convex and nonsmooth
* A convex fn has either:
* no min (goes to $-\infty$)
* just one local min which is the global min
* Connected set of local min which are all global min
* Algorithms for smooth f
* Gradient descent
* blind : repeat $w \leftarrow w - \epsilon \nabla f(w)$
* stochastic (just 1 training pt at a time): $w \leftarrow w - \epsilon L(y_i, z_i)$
* line search
* Newton's method (needs Hessian matrix)
* Nonlinear conjugate gradient
* Algorithms for nonsmooth f
* Gradient descent
* BFGS
* These algs find local min
* Line search: Find local min along search direction by solving an optimization problem in 1D
* Secant method (smooth only)
* Newton-Raphson (smooth only, may need Hessian)
* Direct line search (non smooth) like golden section search
* Constrained Optimization (smooth equality constraints)
* Goal: Find w that minimizes $f(w)$ subject to $g(w)=0$ where g is smooth
* Alg: Use lagrange multipliers
* Linear Program
* Linear objective, linear ineq constraints
* Goal: Find w that maximizes $c^\top w$ subject to $Aw \leq b$
* $A_i^\top w \leq b_i$ for $i \in [1, n]$
* Feasible region: Set of pts that satisfy all constraints
* LP feasible region is a polytope (not always bounded/finite)
* Point set P is convex if every pair of points' line segment lies fully in P
* Active constraints: constraints that achieve equality at optimum
# Lecture 6: 2/8/21
* Decision theory/risk minimization
* Multiple sample pts w/ diff classes can lie at same pt
* Want a probabilistic classifier
* Example: 10% population has cancer, 90% doesn't

* Posterior probability: $P(Y=1|X)$
* Prior probability: $P(Y=1)$
* Loss function $L(z,y)$ where z is prediction, y is true label
* Loss function specifies how bad it is if classifier predicts z but true class is y
* 
* 36% loss of 5 is worse than 64% loss of 1, so recommend further cancer screening
* 0-1 loss function is 1 for incorrect, 0 for correct
* Loss function does not need to be convex
* Risk for decision rule r is expected loss over all values of x,y
* $R(r) = E[L(r(X),Y)]$
* $\sum\limits_x \left(L(r(x),1)P(Y=1|X=x) + L(r(x),-1)P(Y=-1|X=x)\right)P(X=x)$
* Bayes decision rule/Bayes classifier is fn $r^*$ that minimizes functional $R(r)$

* If L is symmetric, pick the class with the biggest posterior probability
* Bayes risk/optimal risk is risk of Bayes classifier
* No decision rule that gives lower risk
* Deriving/using $r^*$ is called risk minimization
* Expected values of $g(X) = E[g(X)] = \int\limits^\infty_{-\infty} g(x)f(x)dx$
* Mean: $\mu = \int\limits^\infty_{-\infty} xf(x)dx$
* If L is 0-1 loss, $R(r)$ is probability that r(x) is wrong
* Bayes optimal decision boundary is $\{ x: P(Y=1 | X=x) = 0.5 \}$
## 3 ways to build classifiers
* Generative models (ex. LDA: Linear Discriminant Analysis)
* Assume sample pts come from probability distributions, different for each class
* Guess form of distribution
* For each class C, fit distribution params to class C pts, giving $f(X|Y=c)$
* For each c, estimate $P(Y=c)$
* Bayes thm gives $P(Y|X)$
* If 0-1 loss, pick class C that maximizes $P(Y=c|X=x)$ equivalently maximizes $f(X=x|Y=c)P(Y=c)$
* Discriminative models (ex. Logistic regression)
* Skip prior probabilities, model $P(Y|X)$ directly
* Find decision boundary (ex. SVM)
* Advantages of 1,2: $P(Y|X)$ tells you probability that your guess is wrong
* Advantage of 1: Diagnose outliers if P(X) is very small
* Disadvantage of 1: Often hard to estimate distributions correctly, real distributions rarely match standard ones
# Lecture 7: 2/10/21
* Gaussian Discriminant Analysis (GDA): Each class comes from a normal distribution
* Isotropic normal dist: $X \sim N(\mu, \sigma^2)$: $f(x) = \frac{1}{(\sqrt{2 \pi} \sigma)^d} \exp(-\frac{\|x-\mu\|^2}{2\sigma^2})$
* $\mu, x$ are length d vectors, $\sigma$ is a scalar
* For each class C, estimate mean $\mu_c$, variance $\sigma_c^2$, prior $\pi_c = P(Y=C)$
* Given x, Bayes decision rule $r^*(x)$ predicts class C that maximizes $f(X=x|Y=C)\pi_C$
* $Q_c(x) = \ln((\sqrt{2 \pi})^d)f_c(x) \pi_c) = -\frac{\|x-\mu\|^2}{2\sigma^2} - d \ln(\sigma_c) + \ln(\pi_c)$
* Quadratic in x, easier to optimize
* Find $\arg \max\limits_c Q_c(x)$
* Baye's decision boundary is where $Q_c(x) - Q_d(x) = 0$
* In 1D, may have 1 or 2 or 0 pts

* Use Bayes' to recover posterior probabilities in 2 clsas
* $P(Y=C|X) = \frac{f(X|Y=C) \pi_c}{f(X|Y=C) \pi_c + f(X|Y=D) \pi_d}$
* $=s(Q_C(x) - Q_D(x))$
* Logistic/sigmoid function: $s(\gamma) = \frac{1}{1 + e^{-\gamma}}$
## Linear Discriminant Analysis (LDA)
* Variant of QDA that has linear decision boundaries
* Less likely to overfit
* Less true probabilistic estimation
* Use validation of both methods to see which is better
* Assume that all gaussians have same variance
* $Q_C(x) - Q_D(x) = \left(\frac{(\mu_C - \mu_D)}{\sigma^2}\right)^\top x + \left(-\frac{\|\mu_C\|^2 - \|\mu_D\|^2}{2\sigma^2} - \ln(\pi_c) - \ln(\pi_d) \right)$
* Choose C that maximizes linear discriminant fn
* $\left(\frac{\mu_C}{\sigma^2}\right)^\top x -\frac{\|\mu_C\|^2}{2\sigma^2} + \ln(\pi_c)$
* Generative model as opposed to discriminative model
## Maximum Likelihood Estimation (MLE)
* Flip biased coin, heads with prob $p$ and tails with prob $1-p$
* Goal: estimate $p$
* num of heads are $X \sim B(n,p)$, binomial dist
* $\mathcal{L}$ is the likelihood fn
* $\mathcal{L}(p) = P(X=8) = 45p^8 (1-p)^2$
* Should be as a function of distribution parameters
* MLE is a method of estimating params of a statistical distribution by picking params that maximize $\mathcal{L}$
* Is a method of density estimation: estimating PDF from data
* Find p that maximizes $\mathcal{L}(p)$
* Likelihood of Gaussian
* $\mathcal{L}(\mu, \sigma; X_1, X_2,\dots X_n) = f(X_1)f(X_2)\cdots f(X_n)$
* $\mathcal{l}$ is $\ln$ likelihood of $\mathcal{L}$
* Maximizing l and log likelihood are same
* $\mathcal{l}(\mu, \sigma; X_1, X_2,\dots X_n) = \ln f(X_1) + \ln f(X_2)\cdots + \ln f(X_n)$
* $= \sum\limits_{i=1}^n \frac{\|X_i - \mu\|^2}{2\sigma^2} - d \ln \sqrt{2\pi} - d \ln \sigma)$
* Which is a summation of ln of normal PDF
* Want to set $\nabla_\mu \mathcal{l} = 0, \frac{\partial \mathcal{l} }{\partial \sigma} = 0$
* $\nabla_\mu \mathcal{l} = 0 \implies \hat{\mu} = \frac{1}{n} \sum\limits_{i=1}^n X_i$
* $\frac{\partial \mathcal{l} }{\partial \sigma} = 0 \implies \hat{\sigma}^2 = \frac{1}{dn}\sum\limits_{i=1}^n \|X_i - \mu\|^2$
* Use $\hat{\mu}$ for $\mu$
* Tells us to use sample mean and variance of pts in class C to estimate mean and variance of Gaussian for class C
* For QDA: estimate conditional mean $\hat{\mu_c}$ and conditional variance $\hat{\sigma_c}^2$ of each class C separately
* $\hat{\pi_c} = \frac{n_c}{\sum n_D}$
* Denominator is total sample pts in all classes
* For LDA: SAme means and priors, 1 variance for all classes
* Pooled within-class variance: $\hat{\sigma}^2 = \frac{1}{dn}\sum\limits_{C} \sum\limits_{\{i \colon y_i = c\}} \|X_i - \hat{\mu}_c\|^2$
# Lecture 8: 2/17/21
* For $v, \lambda, A$, v is an evec of $A^k$ with $\lambda^k$
* If A is invertible, $A^{-1}$ has evec v with $\frac{1}{\lambda}$
* Spectral theorem: Every real symmetric nxn matrix has real eigenvalues and n mutually orthogonal evecs
* $\|A^{-1}x\|^2 = 1$ isan ellpisoid with axes $v_1, \ldots v_n$ and radii $\lambda_1, \ldots \lambda_n$
* Special case if A is diagonal, evecs are coordinate axes and ellipsoid is axis-aligned
* Symmetric matrix is:
* PD: All pos eval
* PSD: All nonneg evals
* Indefinite matrix: has at least 1 pos eval and 1 neg eval
* Invertible: Has no zero eigenvalue
* $x^\top A^{-2} x$ will be PD bc evecs are squared
* Orthonormal matrix: acts like rotation or reflection
* Thm: $A = V \Lambda V^\top = \sum\limits^n_{i=1} \lambda_i v_i v_i^\top$


# Lecture 8: 2/22/21
## Anisotropic Gaussians
* Normal PDF: $f(x) = n(q(x)) = \frac{1}{\sqrt{(2 \pi)^d | \Sigma |}} \exp(- \frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu))$
* $n(q) = \frac{1}{\sqrt{(2 \pi)^d | \Sigma |}}e^{-q/2}$
* $\mathbb{R} \rightarrow \mathbb{R}$ (exponential)
* $q(x) = (x - \mu)^\top \Sigma^{-1} (x - \mu)$
* $\mathbb{R}^d \rightarrow \mathbb{R}$ (quadratic)
* Covariance matrix: $\Sigma = V \Gamma V^\top$
* Eigenvalues of $\Sigma$ are variances along evecs $\Gamma_{ii} = \sigma_i^2$
* $\Sigma ^{\frac{1}{2}} = V \Gamma^{\frac{1}{2}} V^\top$
* Maps spheres to ellipsoids
* Eigenvalues of $\Sigma ^{\frac{1}{2}}$ are stdev $\Gamma_{ii}^{\frac{1}{2}} = \sigma_i$ (Gaussian width/ellipsoid radii)
## MLE for Anisotropic Gaussians
* Given sample pts and classes, find best-fit Gaussians
* QDA: $\hat{\Sigma}_c = \frac{1}{n_c} \sum\limits_{i \colon y_i = C} (X_i - \hat{\mu}_c)(X_i - \hat{\mu}_c)^\top$
* Conditional covariance (for pts in class C)
* Outer product, dxd matrix
* $\hat{\pi}_c, \hat{\mu}_c$ same as before
* sigmac always PSD but not always PD
* Choose C to maximize discriminant fn

* Decision fn is quadratic, may be indefinite
* LDA interpeted as projecting points on a line
* For 2 classes
* LDA has d+1 parameters (w, $\alpha$)
* QDA has $\frac{d(d+3)}{2} + 1$ params
* QDA more likely to overfit
* With added features LDA gan give nonlinear, QDA nonquadratic
* We don't get true optimum Bayes classifier (estimate distributions from finite data)
* Changing priors/loss = adding constants to discriminant fns
* Posteroir gives decision boundaries for 10% probability, 50%, 90%
* Choosing isovalue = p same as
* Choosing asymmetric loss p for false pos, 1-p for false neg OR $\pi_c = 1-p, \pi_d = p$
## Terminology
* Let X be n x d design matrix of sample pts
* Each row i of X is a sample pt $X_i^\top$
* Centering X: Subtract $\mu^\top$ from each row of X $X \rightarrow \dot{X}$
* Let R be uniform distribution on sample pts
* Sample cov matrix is $Var(R) = \frac{1}{n} \dot{X}^\top \dot{X}$
* Decorrelating $\dot{X}$: Apply rotation $Z = \dot{X}V$
* $Var(R) = V \Lambda V^\top$
* $Var(Z) = \Lambda$
* Sphering $\dot{X}$: $W = \dot{X} Var(R)^{-1/2}$
* Whitening X: Centering + sphering
* W has cov matrix I
# Lecture 9: 2/24/21
* Regression: Fitting curves to data
* Classification: Given pt, predict its class
* Regression: Given pt, predict numerical value
* Choose form of regerssion fn $h(x; p)$ with parameter p
* Like decision fn in classification
* Choose a cost fn (objective fn) to optimize
* Usually based on a loss fn
* Regression fns:
* 1) Linear: $h(x; w, \alpha) = w ^\top x + \alpha$
* 2) Polynomial: Add features for linear
* 3) Logistic: $h(x: w, \alpha) = s(w^\top x + \alpha)$
* $s(\gamma) = \frac{1}{1 + e^{- \gamma}}$
* Loss fns: z is prediction $h(x)$, y is true label
* A) Squared error: $L(z,y) = (z-y)^2$
* B) Absolute error: $L(z,y) = |z-y|$
* C) Logistic loss/cross-entropy: $L(z,y) = -y \ln(z) - (1-y) \ln(1-z)$
* Cost fns to minimize
* a) Mean loss: $J(h) = \frac{1}{n} \sum\limits^n_{i=1} L(h(X_i), y_i)$
* b) Max loss: $J(h) = \max\limits_{i=1 \ldots n} L(h(X_i), y_i)$
* c) Weighted sum: $J(h) = \sum\limits^n_{i=1} \omega_i L(h(X_i), y_i)$
* d) $l_2$ penalized/regularized: $J(h) = \ldots + \lambda \|w\|_2^2$
* $\ldots$ is one of the previous options
* e) $l_1$ penalized/regularized: $J(h) = \ldots + \lambda \|w\|_1$
* Famous regression methods:
* Least squares linear regression: 1Aa
* Quadratic cost, min w/ calculus
* Weighted least-squares linear: 1Ac
* Quadratic cost, min w/ calculus
* Ridge regression: 1Ad
* Quadratic cost, min w/ calculus
* LASSO: 1Ae
* QP w/ exponential num of constraints
* Logistic regression: 3Ca
* Convex cost fn, min w/ gradient descent/Newton's
* Least absolute deviations: 1Ba
* LP
* Chebyshev criterion: 1Bb
* LP
## Least Squares Linear Regression
* Linear regression fn (1) + squared loss fn (A) + cost fn (a)
* Optimization problem: $\min\limits_{w, \alpha} \sum\limits^n_{i=1} (X_i ^\top w + \alpha - y_i)^2$
* X is n x d design matrix of sample pts
* y is n-vector of scalar labels
* $X_i^\top$ transposes the column vector point
* Feature column: $X_{*j}$
* Usually $n > d$
* X is now n x (d+1) matrix, w is a (d+1) vector
* $\begin{bmatrix} x_1 & x_2 & 1 \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \\ \alpha \end{bmatrix}$
* Residual sum of squares: $RSS(w) = \min\limits_w \|Xw-y\|^2$
* Optimize with calculus:
* If $X^\top X$ is singular, problem is unconstraind (all pts fall on hyperplane)
* Never overconstrained
* Linear solver to find $w^* = (X^\top X)^{-1} X^\top y$
* Pseudoinverse (d+1, n): $X^+ = (X^\top X)^{-1} X^\top$
* Pseudoinverse is a left inverse: $X^+X = (X^\top X)^{-1} X^\top X = I$
* Predicted values are $\hat{y} = Xw = XX^+y = Hy$
* Hat matrix: $H = XX^+$ (n x n matrix)
* Advantage: Easy to compute, unique/stable solution
* Disadvantage: Very sensitive to outliers bc errors are squared, fails if $X^\top X$ is singular
## Logistic Regression
* Logistic regerssion fn (3) + logistic loss fn (C) + cost fn (a)
* Fits probabilities in range (0,1)
* Discriminative model (as opposed to QDA/LDA being generative models)
* $\min\limits_w J(w) = \min\limits_w \sum\limits^n_{i=1} L(X_iw, y_i) = \min\limits_w - \sum\limits^n_{i=1} \left( y_i \ln(X_iw) + (1-y_i) \ln(1-X_iw) \right)$
* $J(w)$ is convex, solve by gradient descent
* $s'(\gamma) = s(\gamma)(1 - s(\gamma))$
* $s_i = s(X_iw)$, $\nabla_w J = -X^\top (y-s(Xw))$, with $s(Xw)$ applied elementwise
* Gradient descent rule: $w \leftarrow w + \epsilon X^\top (y-s(Xw))$
* SGD: $w \leftarrow w + \epsilon (y_i-s(X_iw))X_i$
* Shuffle pts into random order, process one by one
* For large n, sometimes converges before visiting all pts
* Logistic regression always separates linearly separable pts
* $\nabla^2_w J(w) = \sum\limits^n_{i=1} s_i(1-s_i)X_iX_i^\top = X^\top \Omega X$
* $\Omega = diag(s_i(1-s_i)) \succ 0$ making it PD
* So $X^\top \Omega X \succeq 0$ so $J(w)$ is convex
* Find solution to $(X^\top \Omega X)e = X^\top(y-s)$
* Example of iteratively reweighted least squares
* Misclassified points far from decision boundary have large influence
* Points close to decision boundary also has medium to large influence
* LDA vs Logistic regression
* LDA Advantages:
* For well separated classes LDA is stable, logreg is unstable
* $>2$ classes easy and elegant, logreg needs modifications (softmax regression)
* LDA slightly more accurate when classes nearly normal, esp if n is small
* Logreg advantages:
* More emphasis on decision boundary, always separates linearly separable pts
* More robust on some non-Gaussian distributions (w/ large skew)
# Lecture 10: 3/1/21
## Least Squares Polynomial Regression
* Replace each $X_i$ with feature vector $\Phi(X_i)$
* Ex. $\Phi(X_i) = \begin{bmatrix} X_{i1}^2 & X_{i1}X_{i2} & X_{i2}^2 & X_{i1} & X_{i2} & 1 \end{bmatrix}$
* Can also use non polynomimal features
* Logistic regression + quadratic features = some form of posteriors like QDA
* Very easy to overfit
## Weighted Least-Squares Regression
* Linear regression fn (1) + squared loss fn (A) + cost fn (c)
* Assign each sample pt a weight $\omega_i$, create $\Omega = diag(\omega_i)$
* Bigger $\omega_i$ means work harder to minimize $|\hat{y}_i - y_i|^2$
* $\min\limits_w (Xw-y)^\top \Omega (Xw-y) = \sum\limits_{i=1}^n \omega_i (X_iw - y_i)^2$
* $X^\top \Omega X w = X^\top \Omega y$, solve for $w$
## Newton's Method
* Iterative optimization method for smooth fn $J(w)$
* Often much faster than gradient descent
* Approximate $J(w)$ by quadratic fn
* $\nabla J(w) = \nabla J(v) + (\nabla^2 J(v))(w-v)$
* $w = v - (\nabla^2 J(v))^{-1} \nabla J(v)$
* Repeat until convergence
* Warnings:
* Doesn't know diff between minima, maxima, saddle pts
* Starting pt must be close enough to desired critical point
## Roc Curves
* For test sets to evaluate classifier
* Receiver Operator Characteristics
* Rate of FP vs TP for range of settings
* x-axis is False Positive rate = % of negative items incorrectly classified as pos
* y-axis is True positive rate = % of pos classified as pos
* Top right: always classify positive ($P \geq 0$)
* Bottom left: Always classify negative ($P > 1$)
* Want high area under the curve (close to 1)

# Lecture 11: 3/3/21
## Statistical Justifications for Regression
* Sample pts come from unknown prob dist $X_i \sim D$
* y values are sum of unknown nonrandom fn + random noise
* $y_i = g(X_i) + \epsilon_i, \epsilon \sim D'$ with $D'$ having mean 0
* Goal of regression: find h that estimates $g$
* Ideal approach: Choose $h(x) = E_Y[Y|X=x] = g(x) + E[\epsilon] = g(x)$
## Least Squares regression for MLE
* Suppose $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$
* $y_i \sim \mathcal{N}(g(X_i), \sigma^2)$
* $l(g; X, y) = \ln(F(y_1) f(y_2) f(y_n)) = -\frac{1}{2\sigma^2} \sum (y_i - g(X_i))^2 - const$
* MLE over g is minimizing sum of squared errors
## Empirical Risk
* Risk for hypothesis h is xpected loss $R(h) = E[L]$
* Discriminative: Don't know X dist D
* Empirical distribution: discrete uniform distribution over the sample pts
* Empirical risk: Expected loss under empirical distribution
* $\hat{R}(h) = \frac{1}{n} \sum\limits_{i=1}^n L(h(X_i), y_i)$
* So minimize sum of loss fns
## Logistic Loss for MLE
* Actual probability pt $X_i$ is in class C is $y_i$, predicted prob is $h(X_i)$
* Imagine $\beta$ duplicate copies of $X_i$
* $y_i \beta$ are in class c, $(q - y_i) \beta$ copies are not
* $\mathcal{L}(h; X,y) = \prod h(X_i)^{y_i \beta} (1-h(X_i))^{(1-y_i)\beta}$
* $\min l(h) = -\beta \sum$ logistic loss fn $L(h(X_i), y_i)$
* Logistic loss is $-y_i \ln(z_i) - (1-y_i) \ln(q - z_i)$
* Max Likelihood --> Min sum of logistic loss fns
## Bias-Variance Decomposition
* 2 source of error in hypothesis:
* bias: Error due to inability of hypothesis to fit g perfectly
* variance: Error due to random noise in data
* Model: $X_i \sim D, \epsilon_i \sim D', y_i = g(X_i) + \epsilon_i$
* Fit hypothesis h to X,y
* h is a RV with random weights
* Consider arbitrary pt $z, \gamma = g(z) + \epsilon$
* $E[\gamma] = g(z), var(\gamma) = var(\epsilon)$
* Risk fn when loss = squared error: $r(h) = E[L(h(z), \gamma)]$
* Taking expectation over all possible training sets X,y, values of $gamma$
* $R(h) = E[(h(z)- \gamma)^2]$
* $R(h) = (E[h(z)]-g(z))^2 + Var(h(z)) + Var(\epsilon)$
* First term: bias^2
* Second term: variance of method
* Third term: Irreducible error
* Underfitting = too much bias
* Most overfitting = too much variance
* Training error reflects bias but not variance
* Most dist var $\to 0$ as $n \to \infty$
* If h can fit g exactly, many distributions bias $\to 0$ as $n \to \infty$
* If h cannotfit g well, bias is large for most points
* Adding a good feature reduces bias, adding a bad feature rarely increases it
* Good features reduce bias more than it increases variance
* Adding a feature usually increases variance
* Noise in test set affects only var($\epsilon$)
* Noise in training set affects only bias and var of h
* Can't precisely measure bias and variance of real world data
* Can test learning algs by choosing g and making synthetic data
# Lecture 12: 3/8/21
## Ridge Regression
* 1 + A + d
* l2 penalized mean loss
* $\min \|Xw - y\|^2 + \lambda \|w'\|^2$
* $w'$ is $w$ with component $\alpha$ replaced by zero
* Don't penalize $\alpha$
* Regularization term/penality term for shrinkage
* Small $\|w'\|$ guarantees PD normal eqs
* Reduces overfitting by reducing variance by penalizing large weights
* Normal equations: $(X^\top X + \lambda I')w = X^\top y$
* $I' \triangleq I$ with bottom right = 0
* Solve for w, return $h(z) = w^\top z$
* Increasing $\lambda$ means more regularization
* For $y=Xv+e$, e being noise
* Variance of RR is $var(z^\top (X^\top X + \lambda I')^{-1}X^\top e)$
## Bayesian Justification
* Prior probability: $w' \sim N(0, \sigma^2)$
* Apply MLE
* Max log posterior: $\ln(L(w)) + \ln(f(w')) - const$
* Min $\min \|Xw - y\|^2 + \lambda \|w'\|^2$
## Feature Subset Selection
* All features increase variance, not all features reduce bias
* Idea: Identify poorly predictive features, ignore them (weight 0)
* Less overfitting, smaller test errors
* 2nd motivation: inference
* Usually hard b/c features can partly encode same info
* Algorithm: Try all $2^d - 1$ subsets of features
* Slow, use cross validation
* Heuristic 1: Forward stepwise selection
* Start with null model (0 features)
* Repeatedly add best feature until validation error increases
* Train $O(d^2)$ models
* Not perfect, won't find best 2-features model if neither yields best 1-feature model
* Heuristic 2: Backward stepwise selection
* Start with all d features, repeatedely remove features that give best reduction in val error
* Trains $O(d^2)$ models
## LASSO
* Regression w/ regularization: 1 + A + e
* l1 penalized mean loss
* $\min \|Xw - y\|^2 + \lambda \|w'\|_1$
* Algs: subgradient descent, least-angle regression