CS467 Cheatsheet

Gradient:
$\nabla_{w} f (w) = [\begin{matrix} \frac{\partial f}{\partial w_{0}} \\ ⋮ \\ \frac{\partial f}{\partial w_{d}} \end{matrix}] \in R^{d}$ , where
$w \in R^{d}$
gradient is the direction of steepest ascent
negative gradient is the direction of steepest descent
update rule:
$w \leftarrow w - η \nabla_{w} L (w),$ where
$L (w)$ is the loss function

convex function
$f$ : a line connecting
$(x_{1}, f (x_{1}))$ and
$(x_{2}, f (x_{2}))$ must lie above the function
If
$f^{″} (x) \geq 0$ and exists everywhere, then
$f$ is convex
If
$f$ is convex, then
$g (x) = f (A x + b)$ is convex
If
$f (x)$ and
$g (x)$ are convex, so is
$f (x) + g (x)$
For a convex funcion, any local minimum is a global minimum

posit a probablistic process that generated our data
find parameters
$w$ that make observed data seem most likely
- $max_{w} P (D; w)$ , where data
  $D = {(x^{(i)}, y^{(i)})}_{i = 1}^{n}$
- view
  $w$ as being constant-valued but unknown
Recipe: take the negative log-likelihood of data as the loss function, and then use gradient descent to find the parameters
$w$ that miminize loss
- e.g., for discriminative models,
  $L (w) = - \sum_{i = 1}^{n} \log P (y^{(i)} | x^{(i)}; w)$

assume a prior
$P (w)$ , which models our prior belief or preference about
$w$
- view
  $w$ as a random variable
$max_{w} P (w | D)$ : find parameters
$w$ after seeing the data
$D$
- $= max_{w} P (D | w) P (w)$

logistic function:
$σ (z) = \frac{1}{1 + e^{- z}}$
- range:
  $[0, 1]$
softmax function:
$s (z_{i}) = \frac{e^{z_{i}}}{\sum_{k = 1}^{C} e^{z_{k}}}$
- $\sum_{i = 1}^{C} s (z_{i}) = 1$
when only two classes (binary classification), softmax regression is equivalent to logistic regression
- $\frac{e^{z_{1}}}{e^{z_{0}} + e^{z_{1}}} = \frac{1}{e^{(z_{0} - z_{1})} + 1} = \frac{1}{1 + e^{- z^{'}}},$ where
  $z^{'} = z_{1} - z_{0}$

$P (y = 1 | x) = σ (w^{⊤} x)$
$P (y = - 1 | x) = 1 - P (y = 1 | x) = 1 - \frac{1}{1 + e^{- w^{⊤} x}} = σ (- w^{⊤} x)$
Thus,
$P (y | x)$ can be written as
$σ (y w^{⊤} x)$
With MLE, loss
$L (w) = \sum_{i = 1}^{n} - \log σ (y^{(i)} w^{⊤} x^{(i)})$
- we call
  $y^{(i)} w^{⊤} x^{(i)}$ margin; the larger the margin, the lower the loss
- $- \log σ (\cdot)$ is convex
  $\to$ logistic regression loss function is convex
- $\nabla_{w} L (w) = - \sum_{i = 1}^{n} σ (- y^{(i)} w^{⊤} x^{(i)}) \cdot y^{(i)} \cdot x^{(i)}$

keep the training set when making prediction, "non-parametric"
define a metric that measures the distance between two datapoints, e.g., Euclidean distance
chose a hyperparameter
$k$
To predict the label of a test input
$x$
- find its
  $k$ nearest neighbors in the training set
- predict label which is most common among the neighbors