# CS189 MT Notes # Lecture 1: 1/20/21 * Decision boundary has `d-1` dimensions * Training set error: Fraction of training images not classified correctly * Test set error: Fraction of misclassified NEW images, not seen during training * Outliers: points whose labels are atypical * Overfitting: When test error deteroirates because classifier is sensitive to outliers * Validation data is taken from training data * Train classifier with different hyperparams * Training set: Used to learn model weights * Validation set: Used to tune hyperparams * Test set: Used as final evaluation of model * Supervised learning: * Classification: Is this email spam? * Regression: How likely does this patient have cancer? * Unsupervised learning: * Clustering: Which DNA sequences are similar? * Dimensionality reduction: What are common features of faces? # Lecture 2: 1/25/21 * Decision boundary: Boundary chosen by classifer to separate items * Overfitting: When decision boundary fits sample points so well that it does not classify future points well * Some classifiers work by computing a decision/predictor/discriminant function * A function that maps a pt x to a scalar st $f(x) > 0 \implies x \in C$ and vice versa. Pick one for the 0 case * Decision boundary is $\{ x \in \mathbb{R}^d \colon f(x) = 0 \}$ * This set is in $d-1$ dimensional space (isosurface of f, iso value = 0) * Decision boundary is a linear plane (linear decision function) * $f(x) = w^\top x + \alpha$, * Hyperplane: $H = \{ x: w^\top x = -\alpha\}$ * $w$ is the normal vector of H (perpendicular to H) * If $w$ is unit vector, $w^\top x + \alpha$ is the signed distance from x to H * Distance from H to origin is $\alpha$, $alpha = 0 \longleftrightarrow$ passes through origin * Coefficients in $w, \alpha$ are weights/params * Linearly separable - if there exists a hyperplane that separates all sample pts in class C from all pts not in C ## Math Review * Vectors default to be column vectors * Uppercase roman = matrix, RVs, sets * Lowercase roman = vectors * Greek = scalar * Other scalars: * n = # of sample pts * d = # of features (per point) aka dimensionality * i, j, k are indices * functions (ofen scalar): $f(\cdot)$ ## Centroid method * Centroid method: compute mean $\mu_c$ of all pts in class C, mean $\mu_x$ of all pts not in C * Use decision fn $f(x) = (\mu_c - \mu_x)^\top x - (\mu_c - \mu_x)\frac{\mu_c + \mu_x}{2}$ * $\alpha$ is such that $f(x) = 0$ if it is the midpoint between $\mu_c, \mu_x$ ## Perceptron Algorithm * Slow, but correct for linearly sep pts * Numerical optimization algorithm - gradient descent * For n sample pts $X_1, \ldots X_n$ with labels $y_i \in \{-1, 1\}$ * Consider decision boundary that passes through origin * Goal: find weights w/ constraint: $y_iX_i^\top w \geq 0$ * $X_i^\top w \geq 0$ if $y_i = 1$ * $X_i^\top w \leq 0$ if $y_i = -1$ * Objective/cost/risk function R: * positive if a constraint is violated * Use optimization to choose w that minimizes R * Define loss function * $L(z, y_i) = \begin{cases} 0 & y_i z \geq 0\\ -y_i z & \text{otherwise} \end{cases}$ * $z = X_i^\top w$ is classifier prediction, $y_i$ is correct label * If $z$ has same sign as $y_i$, loss fn is zero * If z has wrong sign, loss fn is positive * Risk/objective/cost fn: * $R(w) = \frac{1}{n} \sum\limits_{i=1}^n L(X_i^\top w, y_i)$ * $R(w) = \frac{1}{n} \sum\limits_{i \in V} -y_iX_i^\top w$ with $V$ having all i st $y_iX_i^\top w < 0$ * $R(w) = 0$ if all X classified correctly, otherwise positive * Goal: Find w that minimizes $R(w)$ * $w=0$ is useless, don't want that # Lecture 3: 1/27/21 * Gradient descent: Find gradient of R wrt w, take step in opposite direction * $\nabla R(W) = \begin{bmatrix} \frac{\partial R}{\partial w_1} & \cdots & \frac{\partial R}{\partial w_d}\end{bmatrix}^\top$ * $\nabla_w(z^\top w) = z$ * $\nabla R(W) = - \sum\limits_{i \in V} y_i X_i$ * Take step in $-\nabla R(W)$ direction * Algorithm: * $w \leftarrow$ arbitrary nonzero vector (good choice is any $y_i X_i$) * while $R(w) > 0$: * $V \leftarrow$ set of indices that $y_iX_i^\top w < 0$ * $w \leftarrow w + \epsilon \sum\limits_{i \in V} y_i X_i$ * return w * $\epsilon > 0$ is the step size/learning rate * Chosen empirically * Cons: Slow, each step takes $\mathcal{O}(nd)$ time - n dot product of vectors in $\mathbb{R}^n$ * Perceptron algorithm guarantees convergence no matter step size ## Stochastic Gradient Descent * Each step, pick 1 misclassified $X_i$ * do gradient descent on loss fn $L(X_i^\top w, y_i)$ * Known as the perceptron algorithm * Each step takes $\mathcal{O}(d)$ time * while $y_iX_i^\top w < 0$: * $w \leftarrow w + \epsilon y_i X_i$ * return w ## Fictitious dimension * If separating hyperplane does not pass origin, add a fictitious dimension * $f(x) = w^\top x + \alpha = \begin{bmatrix}w_1 & w_2 & \alpha \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ 1 \end{bmatrix}$ * Sample pts in $\mathbb{R}^{d+1}$ all lying on hyerplane $x_{d+1} = 1$ * Run perceptron algorithm in $d+1$ dimension ## Hard-margin SVM * Margin of a linear classifier is dist from deicision boundary to nearest sample pt * Try to make margin as wide as possible * If $||w|| = 1$, signed dist is $w ^\top X_i + \alpha$ * Otherwise signed dist is $\frac{w}{||w||} ^\top X_i + \frac{\alpha}{||w||}$ * Hence the margin is $\min\limits_i \frac{1}{||w||} |w^\top X_i + \alpha| \geq \frac{1}{||w||}$ * QP in $d+1$ dimensions, $n$ constraints * Objective: $\min\limits_{w, \alpha} ||w||^2_2$ * Constraints: $y_i (w^\top x_i + \alpha) \geq 1$ for $i = 1 \ldots n$ * Impossible for $w$ to be set to 0 * Has a unique solution (if any) * Maximum margin classifier aka hard-margin SVM * Margin = $\frac{1}{||w^*||}$ * Slab = 2*margin, margin is dist from boundary to nearest pt # Lecture 4: 2/1/21 ## Soft Margin SVM * Margin is no longer dist from decision boundary to nearest pt * Just $\frac{1}{||w||}$ * Prevent abuse of slack by adding a loss term * Objective: Find $w, \alpha, \xi_i$ to $\min ||w||^2 + C\sum\limits^n_{i=1} \xi_i$ * Constraints: $y_i (X_i^\top w + \alpha) \geq 1 - \xi_i$ * $\xi_i \geq 0$ * QP in $d+n+1$ dimensional space and $2n$ constraints * There is a tradeoff between C and $||w||^2$ * $C > 0$ is a hyperparameter scalar | Column 1 | Small C | Big C | | -------- | -------- | -------- | | Desire | Max margin $\frac{1}{\|w\|}$ | Keep most slack variables 0 or small | | Danger | Underfitting (misclassifies training data) | Overfitting (awesome training, awful test) | Outliers | Less sensitive | Very sensitive Boundary | More "flat" | More sinuous/curvy ## Features * Make nonlinear features that lift points to a higher dimensional space * High-d linear classifier <--> low-d nonlinear classifier * Parabolic lifting map: $\Phi \colon \mathbb{R}^d \rightarrow \mathbb{R}^{d+1}$ * $\Phi(x) = \begin{bmatrix} x \\ \|x\|^2 \end{bmatrix}$ * Lifts x onto paraboloid $x_{d+1} = \|x\|^2$ * Linear classifier in $\Phi$ space induces a sphere classifier in $x$ space * Theorem: $\Phi(X_1), \ldots \Phi(X_n)$ are linearly separable iff $X_1, \ldots X_n$ are separable by a hypersphere * Proof: Consider hypersphere in $\mathbb{R}^d$ center c, radius $\rho$ * $\|x-c\|^2 < \rho^2$ * $\|x\|^2 - 2c^\top x + \|c\|^2 < \rho^2$ * $\begin{bmatrix} -2c^\top & 1\end{bmatrix} \begin{bmatrix} x \\ \|x\|^2\end{bmatrix} < \rho^2 - \|c\|^2$ * Normal vector in $\mathbb{R}^{d+1}$, $\Phi(x)$ * Points inside sphere means same side of hyperplane in $\Phi$ space ## Axis-aligned ellipsoid/hyperboloid * $\Phi \colon \mathbb{R}^d \rightarrow \mathbb{R}^{2d}$ * $\Phi(X) = \begin{bmatrix} x_1^2& x_2^2& x_3^2 &\ldots& x_d^2& x_1 &x_2 &\ldots& x_d \end{bmatrix}^\top$ * 3D: $Ax_1^2 + B x_2^2 + Cx_3^2 + Dx_1 + Ex_2 + Fx_3 + \alpha = 0$ * Hyerplane is $w^\top \Phi(x) + \alpha = 0$ * $w = \begin{bmatrix} A & B & C&D&E&F \end{bmatrix}$ ## Ellipsoid/hyperboloid * $\Phi(x) \colon \mathbb{R}^d \rightarrow \mathbb{R}^{(d^2+3d)/2}$ * 3D: $Ax_1^2 + B x_2^2 + Cx_3^2 + Dx_1x_2 + Ex_2x_3 + Fx_3x_1 + Gx_1 + Hx_2 + Ix_3 + \alpha = 0$ * Isosurface dfined by this equation is a quadric (2d: conic section) ## Degree-p polynomial * Ex. cubic in $\mathbb{R}^2$ * $\Phi(x) = \begin{bmatrix} x_1^3 & x_1^2 x_2 & x_1x_2^2 & x_2^3 & x_1^2 & x_1x_2 & x_2^2 & x_1 & x_2 \end{bmatrix}^\top$ * $\Phi(x) \colon \mathbb{R}^d \rightarrow \mathbb{R}^{\mathcal{O}(d^p)}$ * Linear has smaller margin, higher degree gives wider margins * Going to a dimension high enough makes it linearly separable, but could overfit ## Edge Detection * Edge detector: algorithm for approximating grayscale/color gradients in images * tap filter, sobel filter, oriented Gaussian derivative filter * Compute gradients pretending it is a continuous field * Collect line orientations in local histograms (each having 12 orientations), use histograms as features instead of raw pixels # Lecture 5: 2/3/21 ## ML Abstractions * Application/data: * Is data labeled/classified? * Yes: Categorical (classification), quantitative(regression) * No: Similarity (clustering), positioning (dimensionality reduction) * Model: * Decision functions: linear, polynomial, logistic * nearest neighbors, decision trees * Features * Low vs high capacity (linear has low, poly have high, neural nets have high): affect overfitting, underfitting, inference * Capacity - a measure of how complicated your model can get * Optimization Problem * Variables, objective fn, constraints * Ex. unconstrainted, convex programs, least squares, PCA * Optimization Algorithm * Gradient descent, simplex, SVD ## Optimization Problems * Unconstrained: Find w that minimizes a continuous objective fn $f(w)$, f is smooth if gradient is continuous too * Global min: $f(w) \leq f(v), \forall v$ * Local min: $f(w) \leq f(v), \forall v$ in tiny ball centered at w * Finding local min is easy, gloal min is hard * Except for convex functions * Line segment connecting 2 points does not go below f * Perceptron risk fn is convex and nonsmooth * A convex fn has either: * no min (goes to $-\infty$) * just one local min which is the global min * Connected set of local min which are all global min * Algorithms for smooth f * Gradient descent * blind : repeat $w \leftarrow w - \epsilon \nabla f(w)$ * stochastic (just 1 training pt at a time): $w \leftarrow w - \epsilon L(y_i, z_i)$ * line search * Newton's method (needs Hessian matrix) * Nonlinear conjugate gradient * Algorithms for nonsmooth f * Gradient descent * BFGS * These algs find local min * Line search: Find local min along search direction by solving an optimization problem in 1D * Secant method (smooth only) * Newton-Raphson (smooth only, may need Hessian) * Direct line search (non smooth) like golden section search * Constrained Optimization (smooth equality constraints) * Goal: Find w that minimizes $f(w)$ subject to $g(w)=0$ where g is smooth * Alg: Use lagrange multipliers * Linear Program * Linear objective, linear ineq constraints * Goal: Find w that maximizes $c^\top w$ subject to $Aw \leq b$ * $A_i^\top w \leq b_i$ for $i \in [1, n]$ * Feasible region: Set of pts that satisfy all constraints * LP feasible region is a polytope (not always bounded/finite) * Point set P is convex if every pair of points' line segment lies fully in P * Active constraints: constraints that achieve equality at optimum # Lecture 6: 2/8/21 * Decision theory/risk minimization * Multiple sample pts w/ diff classes can lie at same pt * Want a probabilistic classifier * Example: 10% population has cancer, 90% doesn't ![](https://i.imgur.com/O6xGb6b.png) * Posterior probability: $P(Y=1|X)$ * Prior probability: $P(Y=1)$ * Loss function $L(z,y)$ where z is prediction, y is true label * Loss function specifies how bad it is if classifier predicts z but true class is y * ![](https://i.imgur.com/rAiHqd5.png) * 36% loss of 5 is worse than 64% loss of 1, so recommend further cancer screening * 0-1 loss function is 1 for incorrect, 0 for correct * Loss function does not need to be convex * Risk for decision rule r is expected loss over all values of x,y * $R(r) = E[L(r(X),Y)]$ * $\sum\limits_x \left(L(r(x),1)P(Y=1|X=x) + L(r(x),-1)P(Y=-1|X=x)\right)P(X=x)$ * Bayes decision rule/Bayes classifier is fn $r^*$ that minimizes functional $R(r)$ ![](https://i.imgur.com/qvuuBF8.png) * If L is symmetric, pick the class with the biggest posterior probability * Bayes risk/optimal risk is risk of Bayes classifier * No decision rule that gives lower risk * Deriving/using $r^*$ is called risk minimization * Expected values of $g(X) = E[g(X)] = \int\limits^\infty_{-\infty} g(x)f(x)dx$ * Mean: $\mu = \int\limits^\infty_{-\infty} xf(x)dx$ * If L is 0-1 loss, $R(r)$ is probability that r(x) is wrong * Bayes optimal decision boundary is $\{ x: P(Y=1 | X=x) = 0.5 \}$ ## 3 ways to build classifiers * Generative models (ex. LDA: Linear Discriminant Analysis) * Assume sample pts come from probability distributions, different for each class * Guess form of distribution * For each class C, fit distribution params to class C pts, giving $f(X|Y=c)$ * For each c, estimate $P(Y=c)$ * Bayes thm gives $P(Y|X)$ * If 0-1 loss, pick class C that maximizes $P(Y=c|X=x)$ equivalently maximizes $f(X=x|Y=c)P(Y=c)$ * Discriminative models (ex. Logistic regression) * Skip prior probabilities, model $P(Y|X)$ directly * Find decision boundary (ex. SVM) * Advantages of 1,2: $P(Y|X)$ tells you probability that your guess is wrong * Advantage of 1: Diagnose outliers if P(X) is very small * Disadvantage of 1: Often hard to estimate distributions correctly, real distributions rarely match standard ones # Lecture 7: 2/10/21 * Gaussian Discriminant Analysis (GDA): Each class comes from a normal distribution * Isotropic normal dist: $X \sim N(\mu, \sigma^2)$: $f(x) = \frac{1}{(\sqrt{2 \pi} \sigma)^d} \exp(-\frac{\|x-\mu\|^2}{2\sigma^2})$ * $\mu, x$ are length d vectors, $\sigma$ is a scalar * For each class C, estimate mean $\mu_c$, variance $\sigma_c^2$, prior $\pi_c = P(Y=C)$ * Given x, Bayes decision rule $r^*(x)$ predicts class C that maximizes $f(X=x|Y=C)\pi_C$ * $Q_c(x) = \ln((\sqrt{2 \pi})^d)f_c(x) \pi_c) = -\frac{\|x-\mu\|^2}{2\sigma^2} - d \ln(\sigma_c) + \ln(\pi_c)$ * Quadratic in x, easier to optimize * Find $\arg \max\limits_c Q_c(x)$ * Baye's decision boundary is where $Q_c(x) - Q_d(x) = 0$ * In 1D, may have 1 or 2 or 0 pts ![](https://i.imgur.com/13GjAok.png) * Use Bayes' to recover posterior probabilities in 2 clsas * $P(Y=C|X) = \frac{f(X|Y=C) \pi_c}{f(X|Y=C) \pi_c + f(X|Y=D) \pi_d}$ * $=s(Q_C(x) - Q_D(x))$ * Logistic/sigmoid function: $s(\gamma) = \frac{1}{1 + e^{-\gamma}}$ ## Linear Discriminant Analysis (LDA) * Variant of QDA that has linear decision boundaries * Less likely to overfit * Less true probabilistic estimation * Use validation of both methods to see which is better * Assume that all gaussians have same variance * $Q_C(x) - Q_D(x) = \left(\frac{(\mu_C - \mu_D)}{\sigma^2}\right)^\top x + \left(-\frac{\|\mu_C\|^2 - \|\mu_D\|^2}{2\sigma^2} - \ln(\pi_c) - \ln(\pi_d) \right)$ * Choose C that maximizes linear discriminant fn * $\left(\frac{\mu_C}{\sigma^2}\right)^\top x -\frac{\|\mu_C\|^2}{2\sigma^2} + \ln(\pi_c)$ * Generative model as opposed to discriminative model ## Maximum Likelihood Estimation (MLE) * Flip biased coin, heads with prob $p$ and tails with prob $1-p$ * Goal: estimate $p$ * num of heads are $X \sim B(n,p)$, binomial dist * $\mathcal{L}$ is the likelihood fn * $\mathcal{L}(p) = P(X=8) = 45p^8 (1-p)^2$ * Should be as a function of distribution parameters * MLE is a method of estimating params of a statistical distribution by picking params that maximize $\mathcal{L}$ * Is a method of density estimation: estimating PDF from data * Find p that maximizes $\mathcal{L}(p)$ * Likelihood of Gaussian * $\mathcal{L}(\mu, \sigma; X_1, X_2,\dots X_n) = f(X_1)f(X_2)\cdots f(X_n)$ * $\mathcal{l}$ is $\ln$ likelihood of $\mathcal{L}$ * Maximizing l and log likelihood are same * $\mathcal{l}(\mu, \sigma; X_1, X_2,\dots X_n) = \ln f(X_1) + \ln f(X_2)\cdots + \ln f(X_n)$ * $= \sum\limits_{i=1}^n \frac{\|X_i - \mu\|^2}{2\sigma^2} - d \ln \sqrt{2\pi} - d \ln \sigma)$ * Which is a summation of ln of normal PDF * Want to set $\nabla_\mu \mathcal{l} = 0, \frac{\partial \mathcal{l} }{\partial \sigma} = 0$ * $\nabla_\mu \mathcal{l} = 0 \implies \hat{\mu} = \frac{1}{n} \sum\limits_{i=1}^n X_i$ * $\frac{\partial \mathcal{l} }{\partial \sigma} = 0 \implies \hat{\sigma}^2 = \frac{1}{dn}\sum\limits_{i=1}^n \|X_i - \mu\|^2$ * Use $\hat{\mu}$ for $\mu$ * Tells us to use sample mean and variance of pts in class C to estimate mean and variance of Gaussian for class C * For QDA: estimate conditional mean $\hat{\mu_c}$ and conditional variance $\hat{\sigma_c}^2$ of each class C separately * $\hat{\pi_c} = \frac{n_c}{\sum n_D}$ * Denominator is total sample pts in all classes * For LDA: SAme means and priors, 1 variance for all classes * Pooled within-class variance: $\hat{\sigma}^2 = \frac{1}{dn}\sum\limits_{C} \sum\limits_{\{i \colon y_i = c\}} \|X_i - \hat{\mu}_c\|^2$ # Lecture 8: 2/17/21 * For $v, \lambda, A$, v is an evec of $A^k$ with $\lambda^k$ * If A is invertible, $A^{-1}$ has evec v with $\frac{1}{\lambda}$ * Spectral theorem: Every real symmetric nxn matrix has real eigenvalues and n mutually orthogonal evecs * $\|A^{-1}x\|^2 = 1$ isan ellpisoid with axes $v_1, \ldots v_n$ and radii $\lambda_1, \ldots \lambda_n$ * Special case if A is diagonal, evecs are coordinate axes and ellipsoid is axis-aligned * Symmetric matrix is: * PD: All pos eval * PSD: All nonneg evals * Indefinite matrix: has at least 1 pos eval and 1 neg eval * Invertible: Has no zero eigenvalue * $x^\top A^{-2} x$ will be PD bc evecs are squared * Orthonormal matrix: acts like rotation or reflection * Thm: $A = V \Lambda V^\top = \sum\limits^n_{i=1} \lambda_i v_i v_i^\top$ ![](https://i.imgur.com/2Y8a7Lg.png) ![](https://i.imgur.com/SMYxFKv.png) # Lecture 8: 2/22/21 ## Anisotropic Gaussians * Normal PDF: $f(x) = n(q(x)) = \frac{1}{\sqrt{(2 \pi)^d | \Sigma |}} \exp(- \frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu))$ * $n(q) = \frac{1}{\sqrt{(2 \pi)^d | \Sigma |}}e^{-q/2}$ * $\mathbb{R} \rightarrow \mathbb{R}$ (exponential) * $q(x) = (x - \mu)^\top \Sigma^{-1} (x - \mu)$ * $\mathbb{R}^d \rightarrow \mathbb{R}$ (quadratic) * Covariance matrix: $\Sigma = V \Gamma V^\top$ * Eigenvalues of $\Sigma$ are variances along evecs $\Gamma_{ii} = \sigma_i^2$ * $\Sigma ^{\frac{1}{2}} = V \Gamma^{\frac{1}{2}} V^\top$ * Maps spheres to ellipsoids * Eigenvalues of $\Sigma ^{\frac{1}{2}}$ are stdev $\Gamma_{ii}^{\frac{1}{2}} = \sigma_i$ (Gaussian width/ellipsoid radii) ## MLE for Anisotropic Gaussians * Given sample pts and classes, find best-fit Gaussians * QDA: $\hat{\Sigma}_c = \frac{1}{n_c} \sum\limits_{i \colon y_i = C} (X_i - \hat{\mu}_c)(X_i - \hat{\mu}_c)^\top$ * Conditional covariance (for pts in class C) * Outer product, dxd matrix * $\hat{\pi}_c, \hat{\mu}_c$ same as before * sigmac always PSD but not always PD * Choose C to maximize discriminant fn ![](https://i.imgur.com/5KxJkeE.png) * Decision fn is quadratic, may be indefinite * LDA interpeted as projecting points on a line * For 2 classes * LDA has d+1 parameters (w, $\alpha$) * QDA has $\frac{d(d+3)}{2} + 1$ params * QDA more likely to overfit * With added features LDA gan give nonlinear, QDA nonquadratic * We don't get true optimum Bayes classifier (estimate distributions from finite data) * Changing priors/loss = adding constants to discriminant fns * Posteroir gives decision boundaries for 10% probability, 50%, 90% * Choosing isovalue = p same as * Choosing asymmetric loss p for false pos, 1-p for false neg OR $\pi_c = 1-p, \pi_d = p$ ## Terminology * Let X be n x d design matrix of sample pts * Each row i of X is a sample pt $X_i^\top$ * Centering X: Subtract $\mu^\top$ from each row of X $X \rightarrow \dot{X}$ * Let R be uniform distribution on sample pts * Sample cov matrix is $Var(R) = \frac{1}{n} \dot{X}^\top \dot{X}$ * Decorrelating $\dot{X}$: Apply rotation $Z = \dot{X}V$ * $Var(R) = V \Lambda V^\top$ * $Var(Z) = \Lambda$ * Sphering $\dot{X}$: $W = \dot{X} Var(R)^{-1/2}$ * Whitening X: Centering + sphering * W has cov matrix I # Lecture 9: 2/24/21 * Regression: Fitting curves to data * Classification: Given pt, predict its class * Regression: Given pt, predict numerical value * Choose form of regerssion fn $h(x; p)$ with parameter p * Like decision fn in classification * Choose a cost fn (objective fn) to optimize * Usually based on a loss fn * Regression fns: * 1) Linear: $h(x; w, \alpha) = w ^\top x + \alpha$ * 2) Polynomial: Add features for linear * 3) Logistic: $h(x: w, \alpha) = s(w^\top x + \alpha)$ * $s(\gamma) = \frac{1}{1 + e^{- \gamma}}$ * Loss fns: z is prediction $h(x)$, y is true label * A) Squared error: $L(z,y) = (z-y)^2$ * B) Absolute error: $L(z,y) = |z-y|$ * C) Logistic loss/cross-entropy: $L(z,y) = -y \ln(z) - (1-y) \ln(1-z)$ * Cost fns to minimize * a) Mean loss: $J(h) = \frac{1}{n} \sum\limits^n_{i=1} L(h(X_i), y_i)$ * b) Max loss: $J(h) = \max\limits_{i=1 \ldots n} L(h(X_i), y_i)$ * c) Weighted sum: $J(h) = \sum\limits^n_{i=1} \omega_i L(h(X_i), y_i)$ * d) $l_2$ penalized/regularized: $J(h) = \ldots + \lambda \|w\|_2^2$ * $\ldots$ is one of the previous options * e) $l_1$ penalized/regularized: $J(h) = \ldots + \lambda \|w\|_1$ * Famous regression methods: * Least squares linear regression: 1Aa * Quadratic cost, min w/ calculus * Weighted least-squares linear: 1Ac * Quadratic cost, min w/ calculus * Ridge regression: 1Ad * Quadratic cost, min w/ calculus * LASSO: 1Ae * QP w/ exponential num of constraints * Logistic regression: 3Ca * Convex cost fn, min w/ gradient descent/Newton's * Least absolute deviations: 1Ba * LP * Chebyshev criterion: 1Bb * LP ## Least Squares Linear Regression * Linear regression fn (1) + squared loss fn (A) + cost fn (a) * Optimization problem: $\min\limits_{w, \alpha} \sum\limits^n_{i=1} (X_i ^\top w + \alpha - y_i)^2$ * X is n x d design matrix of sample pts * y is n-vector of scalar labels * $X_i^\top$ transposes the column vector point * Feature column: $X_{*j}$ * Usually $n > d$ * X is now n x (d+1) matrix, w is a (d+1) vector * $\begin{bmatrix} x_1 & x_2 & 1 \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \\ \alpha \end{bmatrix}$ * Residual sum of squares: $RSS(w) = \min\limits_w \|Xw-y\|^2$ * Optimize with calculus: * If $X^\top X$ is singular, problem is unconstraind (all pts fall on hyperplane) * Never overconstrained * Linear solver to find $w^* = (X^\top X)^{-1} X^\top y$ * Pseudoinverse (d+1, n): $X^+ = (X^\top X)^{-1} X^\top$ * Pseudoinverse is a left inverse: $X^+X = (X^\top X)^{-1} X^\top X = I$ * Predicted values are $\hat{y} = Xw = XX^+y = Hy$ * Hat matrix: $H = XX^+$ (n x n matrix) * Advantage: Easy to compute, unique/stable solution * Disadvantage: Very sensitive to outliers bc errors are squared, fails if $X^\top X$ is singular ## Logistic Regression * Logistic regerssion fn (3) + logistic loss fn (C) + cost fn (a) * Fits probabilities in range (0,1) * Discriminative model (as opposed to QDA/LDA being generative models) * $\min\limits_w J(w) = \min\limits_w \sum\limits^n_{i=1} L(X_iw, y_i) = \min\limits_w - \sum\limits^n_{i=1} \left( y_i \ln(X_iw) + (1-y_i) \ln(1-X_iw) \right)$ * $J(w)$ is convex, solve by gradient descent * $s'(\gamma) = s(\gamma)(1 - s(\gamma))$ * $s_i = s(X_iw)$, $\nabla_w J = -X^\top (y-s(Xw))$, with $s(Xw)$ applied elementwise * Gradient descent rule: $w \leftarrow w + \epsilon X^\top (y-s(Xw))$ * SGD: $w \leftarrow w + \epsilon (y_i-s(X_iw))X_i$ * Shuffle pts into random order, process one by one * For large n, sometimes converges before visiting all pts * Logistic regression always separates linearly separable pts * $\nabla^2_w J(w) = \sum\limits^n_{i=1} s_i(1-s_i)X_iX_i^\top = X^\top \Omega X$ * $\Omega = diag(s_i(1-s_i)) \succ 0$ making it PD * So $X^\top \Omega X \succeq 0$ so $J(w)$ is convex * Find solution to $(X^\top \Omega X)e = X^\top(y-s)$ * Example of iteratively reweighted least squares * Misclassified points far from decision boundary have large influence * Points close to decision boundary also has medium to large influence * LDA vs Logistic regression * LDA Advantages: * For well separated classes LDA is stable, logreg is unstable * $>2$ classes easy and elegant, logreg needs modifications (softmax regression) * LDA slightly more accurate when classes nearly normal, esp if n is small * Logreg advantages: * More emphasis on decision boundary, always separates linearly separable pts * More robust on some non-Gaussian distributions (w/ large skew) # Lecture 10: 3/1/21 ## Least Squares Polynomial Regression * Replace each $X_i$ with feature vector $\Phi(X_i)$ * Ex. $\Phi(X_i) = \begin{bmatrix} X_{i1}^2 & X_{i1}X_{i2} & X_{i2}^2 & X_{i1} & X_{i2} & 1 \end{bmatrix}$ * Can also use non polynomimal features * Logistic regression + quadratic features = some form of posteriors like QDA * Very easy to overfit ## Weighted Least-Squares Regression * Linear regression fn (1) + squared loss fn (A) + cost fn (c) * Assign each sample pt a weight $\omega_i$, create $\Omega = diag(\omega_i)$ * Bigger $\omega_i$ means work harder to minimize $|\hat{y}_i - y_i|^2$ * $\min\limits_w (Xw-y)^\top \Omega (Xw-y) = \sum\limits_{i=1}^n \omega_i (X_iw - y_i)^2$ * $X^\top \Omega X w = X^\top \Omega y$, solve for $w$ ## Newton's Method * Iterative optimization method for smooth fn $J(w)$ * Often much faster than gradient descent * Approximate $J(w)$ by quadratic fn * $\nabla J(w) = \nabla J(v) + (\nabla^2 J(v))(w-v)$ * $w = v - (\nabla^2 J(v))^{-1} \nabla J(v)$ * Repeat until convergence * Warnings: * Doesn't know diff between minima, maxima, saddle pts * Starting pt must be close enough to desired critical point ## Roc Curves * For test sets to evaluate classifier * Receiver Operator Characteristics * Rate of FP vs TP for range of settings * x-axis is False Positive rate = % of negative items incorrectly classified as pos * y-axis is True positive rate = % of pos classified as pos * Top right: always classify positive ($P \geq 0$) * Bottom left: Always classify negative ($P > 1$) * Want high area under the curve (close to 1) ![](https://i.imgur.com/03KlejE.png) # Lecture 11: 3/3/21 ## Statistical Justifications for Regression * Sample pts come from unknown prob dist $X_i \sim D$ * y values are sum of unknown nonrandom fn + random noise * $y_i = g(X_i) + \epsilon_i, \epsilon \sim D'$ with $D'$ having mean 0 * Goal of regression: find h that estimates $g$ * Ideal approach: Choose $h(x) = E_Y[Y|X=x] = g(x) + E[\epsilon] = g(x)$ ## Least Squares regression for MLE * Suppose $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ * $y_i \sim \mathcal{N}(g(X_i), \sigma^2)$ * $l(g; X, y) = \ln(F(y_1) f(y_2) f(y_n)) = -\frac{1}{2\sigma^2} \sum (y_i - g(X_i))^2 - const$ * MLE over g is minimizing sum of squared errors ## Empirical Risk * Risk for hypothesis h is xpected loss $R(h) = E[L]$ * Discriminative: Don't know X dist D * Empirical distribution: discrete uniform distribution over the sample pts * Empirical risk: Expected loss under empirical distribution * $\hat{R}(h) = \frac{1}{n} \sum\limits_{i=1}^n L(h(X_i), y_i)$ * So minimize sum of loss fns ## Logistic Loss for MLE * Actual probability pt $X_i$ is in class C is $y_i$, predicted prob is $h(X_i)$ * Imagine $\beta$ duplicate copies of $X_i$ * $y_i \beta$ are in class c, $(q - y_i) \beta$ copies are not * $\mathcal{L}(h; X,y) = \prod h(X_i)^{y_i \beta} (1-h(X_i))^{(1-y_i)\beta}$ * $\min l(h) = -\beta \sum$ logistic loss fn $L(h(X_i), y_i)$ * Logistic loss is $-y_i \ln(z_i) - (1-y_i) \ln(q - z_i)$ * Max Likelihood --> Min sum of logistic loss fns ## Bias-Variance Decomposition * 2 source of error in hypothesis: * bias: Error due to inability of hypothesis to fit g perfectly * variance: Error due to random noise in data * Model: $X_i \sim D, \epsilon_i \sim D', y_i = g(X_i) + \epsilon_i$ * Fit hypothesis h to X,y * h is a RV with random weights * Consider arbitrary pt $z, \gamma = g(z) + \epsilon$ * $E[\gamma] = g(z), var(\gamma) = var(\epsilon)$ * Risk fn when loss = squared error: $r(h) = E[L(h(z), \gamma)]$ * Taking expectation over all possible training sets X,y, values of $gamma$ * $R(h) = E[(h(z)- \gamma)^2]$ * $R(h) = (E[h(z)]-g(z))^2 + Var(h(z)) + Var(\epsilon)$ * First term: bias^2 * Second term: variance of method * Third term: Irreducible error * Underfitting = too much bias * Most overfitting = too much variance * Training error reflects bias but not variance * Most dist var $\to 0$ as $n \to \infty$ * If h can fit g exactly, many distributions bias $\to 0$ as $n \to \infty$ * If h cannotfit g well, bias is large for most points * Adding a good feature reduces bias, adding a bad feature rarely increases it * Good features reduce bias more than it increases variance * Adding a feature usually increases variance * Noise in test set affects only var($\epsilon$) * Noise in training set affects only bias and var of h * Can't precisely measure bias and variance of real world data * Can test learning algs by choosing g and making synthetic data # Lecture 12: 3/8/21 ## Ridge Regression * 1 + A + d * l2 penalized mean loss * $\min \|Xw - y\|^2 + \lambda \|w'\|^2$ * $w'$ is $w$ with component $\alpha$ replaced by zero * Don't penalize $\alpha$ * Regularization term/penality term for shrinkage * Small $\|w'\|$ guarantees PD normal eqs * Reduces overfitting by reducing variance by penalizing large weights * Normal equations: $(X^\top X + \lambda I')w = X^\top y$ * $I' \triangleq I$ with bottom right = 0 * Solve for w, return $h(z) = w^\top z$ * Increasing $\lambda$ means more regularization * For $y=Xv+e$, e being noise * Variance of RR is $var(z^\top (X^\top X + \lambda I')^{-1}X^\top e)$ ## Bayesian Justification * Prior probability: $w' \sim N(0, \sigma^2)$ * Apply MLE * Max log posterior: $\ln(L(w)) + \ln(f(w')) - const$ * Min $\min \|Xw - y\|^2 + \lambda \|w'\|^2$ ## Feature Subset Selection * All features increase variance, not all features reduce bias * Idea: Identify poorly predictive features, ignore them (weight 0) * Less overfitting, smaller test errors * 2nd motivation: inference * Usually hard b/c features can partly encode same info * Algorithm: Try all $2^d - 1$ subsets of features * Slow, use cross validation * Heuristic 1: Forward stepwise selection * Start with null model (0 features) * Repeatedly add best feature until validation error increases * Train $O(d^2)$ models * Not perfect, won't find best 2-features model if neither yields best 1-feature model * Heuristic 2: Backward stepwise selection * Start with all d features, repeatedely remove features that give best reduction in val error * Trains $O(d^2)$ models ## LASSO * Regression w/ regularization: 1 + A + e * l1 penalized mean loss * $\min \|Xw - y\|^2 + \lambda \|w'\|_1$ * Algs: subgradient descent, least-angle regression