# [CS467 Cheatsheet](https://robinjia.github.io/classes/spring2023-csci467/) ### Gradient Descent - Gradient: ${\nabla_w f(w)} =\begin{bmatrix} \frac{\partial f}{{\partial{w_0}}} \\ \vdots \\ \frac{\partial f}{{\partial{w_d}}} \end{bmatrix}\in \mathbb{R}^d$, where $w \in \mathbb{R}^d$ - gradient is the direction of steepest ascent - negative gradient is the direction of steepest descent - update rule: $w \leftarrow w - \eta \nabla_w L(w),$ where $L(w)$ is the loss function ### Convextiy - convex function $f$: a line connecting $(x_1, f(x_1))$ and $(x_2, f(x_2))$ must lie above the function - If $f''(x) \ge 0$ and exists everywhere, then $f$ is convex - If $f$ is convex, then $g(x) = f(Ax+b)$ is convex - If $f(x)$ and $g(x)$ are convex, so is $f(x)+g(x)$ - For a convex funcion, any local minimum is a global minimum ### Maximum Likelihood Estimation (MLE) - posit a [probablistic process](https://robinjia.github.io/assets/csci467/lectures/lecture6.pdf) that generated our data - find parameters $w$ that make observed data seem most likely - $\max_\limits{w}P(D;w)$, where data $D=\{(x^{({i})}, y^{({i})})\}_{i=1}^n$ - view $w$ as being constant-valued but unknown - *Recipe*: take the negative log-likelihood of data as the loss function, and then use [gradient descent](#Gradient-Descent) to find the parameters $w$ that miminize loss - e.g., for [discriminative](#Discriminative) models, $L(w) = - \sum_{i=1}^{n} \log P(y^{(i)}|x^{(i)}; w)$ ### Maximum A Posteriori Estimation (MAP) - assume a prior $P(w)$, which models our prior belief or preference about $w$ - view $w$ as a random variable - $\max_\limits{w} P(w|D)$: find parameters $w$ after seeing the data $D$ - $= \max_\limits{w} P(D|w) P(w)$ ### Linear/Logistic/Softmax Regression - Linear regression for regression tasks (predict scalars) - has closed-form solution, aka the normal equation: - set $\nabla_w L(w)$ to $0$ - $w^* = (X^\top X)^{-1}X^\top y \,,$ where $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^{n}$ - Logistic regression for binary classification tasks - Softmax regression for multiclass classification tasks #### Logistic vs. Softmax - logistic function: $\sigma(z) = \frac{1}{1+e^{-z}}$ - range: $[0, 1]$ - softmax function: $s(z_i) = \frac{e^{z_i}}{\sum_{k=1}^{C}e^{z_k}}$ - $\sum_{i=1}^{C}s({z_i}) = 1$ - when only two classes (binary classification), softmax regression is equivalent to logistic regression - $\frac{e^{z_1}}{e^{z_0}+e^{z_1}} = \frac{1}{e^{(z_0-z_1)}+1} = \frac{1}{1+e^{-z^\prime}}\;,$ where $z^\prime = z_1-z_0$ #### Logistic Regression - $P(y=1|x) = \sigma(w^{\top}x)$ - $P(y=-1|x) = 1-P(y=1|x) = 1- \frac{1}{1+e^{-w^{\top} x}} = \sigma(-w^{\top}x)$ - Thus, $P(y|x)$ can be written as $\sigma(y\,w^{\top}x)$ - With MLE, loss $L(w) =\sum_{i=1}^{n} -\log \sigma(y^{(i)}\,w^{\top}x^{(i)})$ - we call $y^{(i)}\,w^{\top}x^{(i)}$ *margin*; the larger the margin, the lower the loss - $-\log\sigma(\cdot)$ is convex $\rightarrow$ logistic regression loss function is convex - $\nabla_w L(w) = - \sum_{i=1}^{n} \sigma(-y^{(i)}\,w^{\top}x^{(i)}) \cdot y^{(i)} \cdot x^{(i)}$ ### Discriminative vs. Generative Models #### Discriminative - directly learn $P(y|x)$ - MLE maximizes conditional likelihood $P(y|x)$ - Likelihood of data: $\prod_{i=1}^{n} P(y^{(i)}|x^{(i)}; w)$ - Log-likelihood: $\sum_{i=1}^{n} \log P(y^{(i)}|x^{(i)}; w)$ - e.g., Logistic Regression #### Generative - learn $P(x|y)$ (features given the class) - learn $P(y)$ (class prior) - MLE maximizes joint likelihood $P(x,y) = P(x|y)P(y)$ - prediction: we plug in the learned $P(x|y)$ and $P(y)$ to get $P(y | x)$ - $P(y | x) = \frac{P(x|y)P(y)}{P(x)}$, where $P(x) = \sum_{k=1}^C P(x|y=k)P(y=k)$ - e.g., for binary classification, $P(y=1 | x) = \frac{P(x|y=1)P(y=1)}{P(x|y=0)P(y=0) + P(x|y=1)P(y=1)}$ - e.g., ‌Naive Bayes ### $k$NN - keep the training set when making prediction, "non-parametric" - define a metric that measures the distance between two datapoints, e.g., Euclidean distance - chose a hyperparameter $k$ - To predict the label of a test input $x$ - find its $k$ nearest neighbors in the training set - predict label which is most common among the neighbors