# [CS467 Cheatsheet](https://robinjia.github.io/classes/spring2023-csci467/)
### Gradient Descent
- Gradient: ${\nabla_w f(w)} =\begin{bmatrix} \frac{\partial f}{{\partial{w_0}}} \\ \vdots \\ \frac{\partial f}{{\partial{w_d}}} \end{bmatrix}\in \mathbb{R}^d$, where $w \in \mathbb{R}^d$
- gradient is the direction of steepest ascent
- negative gradient is the direction of steepest descent
- update rule: $w \leftarrow w - \eta \nabla_w L(w),$ where $L(w)$ is the loss function
### Convextiy
- convex function $f$: a line connecting $(x_1, f(x_1))$ and $(x_2, f(x_2))$ must lie above the function
- If $f''(x) \ge 0$ and exists everywhere, then $f$ is convex
- If $f$ is convex, then $g(x) = f(Ax+b)$ is convex
- If $f(x)$ and $g(x)$ are convex, so is $f(x)+g(x)$
- For a convex funcion, any local minimum is a global minimum
### Maximum Likelihood Estimation (MLE)
- posit a [probablistic process](https://robinjia.github.io/assets/csci467/lectures/lecture6.pdf) that generated our data
- find parameters $w$ that make observed data seem most likely
- $\max_\limits{w}P(D;w)$, where data $D=\{(x^{({i})}, y^{({i})})\}_{i=1}^n$
- view $w$ as being constant-valued but unknown
- *Recipe*: take the negative log-likelihood of data as the loss function, and then use [gradient descent](#Gradient-Descent) to find the parameters $w$ that miminize loss
- e.g., for [discriminative](#Discriminative) models, $L(w) = - \sum_{i=1}^{n} \log P(y^{(i)}|x^{(i)}; w)$
### Maximum A Posteriori Estimation (MAP)
- assume a prior $P(w)$, which models our prior belief or preference about $w$
- view $w$ as a random variable
- $\max_\limits{w} P(w|D)$: find parameters $w$ after seeing the data $D$
- $= \max_\limits{w} P(D|w) P(w)$
### Linear/Logistic/Softmax Regression
- Linear regression for regression tasks (predict scalars)
- has closed-form solution, aka the normal equation:
- set $\nabla_w L(w)$ to $0$
- $w^* = (X^\top X)^{-1}X^\top y \,,$ where $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^{n}$
- Logistic regression for binary classification tasks
- Softmax regression for multiclass classification tasks
#### Logistic vs. Softmax
- logistic function: $\sigma(z) = \frac{1}{1+e^{-z}}$
- range: $[0, 1]$
- softmax function: $s(z_i) = \frac{e^{z_i}}{\sum_{k=1}^{C}e^{z_k}}$
- $\sum_{i=1}^{C}s({z_i}) = 1$
- when only two classes (binary classification), softmax regression is equivalent to logistic regression
- $\frac{e^{z_1}}{e^{z_0}+e^{z_1}} = \frac{1}{e^{(z_0-z_1)}+1} = \frac{1}{1+e^{-z^\prime}}\;,$ where $z^\prime = z_1-z_0$
#### Logistic Regression
- $P(y=1|x) = \sigma(w^{\top}x)$
- $P(y=-1|x) = 1-P(y=1|x) = 1- \frac{1}{1+e^{-w^{\top} x}} = \sigma(-w^{\top}x)$
- Thus, $P(y|x)$ can be written as $\sigma(y\,w^{\top}x)$
- With MLE, loss $L(w) =\sum_{i=1}^{n} -\log \sigma(y^{(i)}\,w^{\top}x^{(i)})$
- we call $y^{(i)}\,w^{\top}x^{(i)}$ *margin*; the larger the margin, the lower the loss
- $-\log\sigma(\cdot)$ is convex $\rightarrow$ logistic regression loss function is convex
- $\nabla_w L(w) = - \sum_{i=1}^{n} \sigma(-y^{(i)}\,w^{\top}x^{(i)}) \cdot y^{(i)} \cdot x^{(i)}$
### Discriminative vs. Generative Models
#### Discriminative
- directly learn $P(y|x)$
- MLE maximizes conditional likelihood $P(y|x)$
- Likelihood of data: $\prod_{i=1}^{n} P(y^{(i)}|x^{(i)}; w)$
- Log-likelihood: $\sum_{i=1}^{n} \log P(y^{(i)}|x^{(i)}; w)$
- e.g., Logistic Regression
#### Generative
- learn $P(x|y)$ (features given the class)
- learn $P(y)$ (class prior)
- MLE maximizes joint likelihood $P(x,y) = P(x|y)P(y)$
- prediction: we plug in the learned $P(x|y)$ and $P(y)$ to get $P(y | x)$
- $P(y | x) = \frac{P(x|y)P(y)}{P(x)}$, where $P(x) = \sum_{k=1}^C P(x|y=k)P(y=k)$
- e.g., for binary classification, $P(y=1 | x) = \frac{P(x|y=1)P(y=1)}{P(x|y=0)P(y=0) + P(x|y=1)P(y=1)}$
- e.g., ‌Naive Bayes
### $k$NN
- keep the training set when making prediction, "non-parametric"
- define a metric that measures the distance between two datapoints, e.g., Euclidean distance
- chose a hyperparameter $k$
- To predict the label of a test input $x$
- find its $k$ nearest neighbors in the training set
- predict label which is most common among the neighbors