# Regression
## k-nearest-neighbors (knn)
- Given a small dataset in a graph and a query point, you will be asked what is predictd for query point for different k values.
- Algorithm: take query point q and find the k nearest points (euclidian distance). The value of q will be the average of the value of the k points.
## Regression Trees
- Given a small dataset, produce a regression stump that minimizes the MSE
- Split the dataset in half. Take the average value of both sides. These will be the values of the leaves. Get the MSE. Use the split that minimizes the MSE.
- For all splits, calcualte the MSE take the smallest one.
- L2 norm
## Linear Regression
- Given the following table:
| Weight Vector | MSE | MAE | L2-norm | L1-norm |
|---------------|-----|-----|---------|---------|
| w1 | | | | |
| w2 | | | | |
| w3 | | | | |
Imagine minimizing __ loss with __ regularizer but 1 is unknown, then what values of w are a solution? We must give answer as a set. "which w are a solution for some value of lambda"?
- Will try to trick with L2-norm and L2-norm squared
## Cross Validation
- given a small dataset, what is the cross validation error using n fold and k nearest neighbors?
- Leave one out cross validation: For each point, find the predicted value of that point if that point were not in the dataset. Calculate the MAE/MSE for each point and average. If n folds just leave n points out.
# Classification
## k-nearest-neighbors
- Instead of taking the average of the k nearest points, find the query point will be the majority class of the k nearest points.
## Regression Trees
- Choose the split the minimizes the gini index.
- Gini index for all leaves leaf: $$1 - \sum_{i}^{n}p_{i}$$. Where n is the number of classes. and $p_{i}$ is the probability of said class in the split.
- Compute a weighted sum of all gini indexes of the leaf nodes w.r.t (total examples in leaf)/(total examples in dataset)
- The smaller the weighted sum the better the split.
## Linear Regression
- Same type of question in regression with the table.
- The age-old logistic regression question: Is logistic regression: Is logistic regression used for classification? **100% on exam**
- NNs: Given an expression graph, evaluate it and compute the gradient. There is a template for this problem on Piazza. Can elave the answer unevulated and don't distribute **100% on exam**
- L1 norm: Lasso Regression, L2 norm Ridge Regression.
- Basically weighted sum with \lambda as the single weight. Diff between l1 and l2 is to square or not.
# Kernels
## Regular Ridge Regression
Learning objective of regular ridge regression:
$$R(w) = ||y-Xw||^2 + \lambda||w||^2$$
Now, we calculate the gradient $\nabla R(w)$:
$$\nabla R(w) = -2X^T(y-Xw)+2\lambda w$$
Set this gradient to 0 and solve for $w$
$$w = (X^TX+\lambda I)^{-1}X^Ty$$
Note that $X^TX \in \mathbb{R}^{D\times D}$ and $X^T \in \mathbb{R}^{D}$. Thus the result will be of a dimension D. This is what we want.
Now let's define $C$ where $C$,
$$C = \sum_{n=1}^{N}x^{(n)}x^{(n)^T}$$
We can subsitute this back into $w$ to get the following,
$$w = (C+\lambda I)^{-1}X^Ty$$
Prediction will simply be:
$$\hat{y}=w^Tx$$
- Time to learn: $D^2N$
- Solving linear system: $D^3$
- Total complexity: Learning: $D^2N+D^3$, Prediction: $D$
## Dual Ridge Regression
Consider the following theorem:
$$(PQ + I)^{-1}P = P(QP+I)^{-1}$$
Now, by setting $P=(1/\lambda)X^T, Q=X$ we get teh following:
$$w = X^T(XX^T + \lambda I)^{-1}y$$
Observe that $X^TX$ is a $N\times D$ matrix while $XX^T$ is a $N\times N$ matrix.
Now let
$$\alpha = (XX^T + \lambda I)^{-1}y$$
To predict we simply do the following:
$$\hat{y}=w^Tx = \alpha^TXx = \sum_{n=1}^{N}a_{n}\cdot x^{(n)}$$
Now let $K$ be the following:
$$K_{nm} = \sum_{n=1}^{N}x^{(n)} \cdot x^{(m)}$$.
We can subsitute this back in to $\alpha$ to get the following:
$$\alpha = (K+\lambda I^{-1}y$$
Prediction will then be the following:
$$\hat{y} = \sum_{n=1}^{N}\alpha_{n}x^{(n)}\cdot x$$
- Time to learn: $D^2N$
- Solving linear system: $N^3$
- Total complexity: Learning: $DN^2+N^3$, Prediction $ND$
Here is a helpful table outlining the time complexity of ridge and dual ridge regression:
| | Regular Ridge | Dual Ridge |
|------------|---------------|------------|
| Learning | $D^2N + D^3$ | $DN^2+N^3$ |
| Prediction | D | ND |
## Kernel Ridge Regression
The important idea is that calcualting $h(x)\cdot h(x)'$ where $h$ is a basis expansion on $x$ can be computed much faster than $h(x)$ itself. We do this by using kernels. We can define $h(x)\cdot h(x')$ as the following:
$$k(x,x')$$
We can subsitute $k$ into our dual ridge regression formulas like the following:
Learning:
$$K_{nm} = \sum_{n=1}^{N}k(x^{(n)}, x^{(m)}$$
$$\alpha = (K +\lambda I)^{-1}y$$
Prediction:
$$\hat{y} = \sum_{n=1}^{N}\alpha_{n}k(x^{(n)},x)$$
- Mercer's Theorem:
If the kernel function is posivite-definite then you can use the kernel function. (eigenvalues cannot be negative).
- Representer Theorem:
You can take the use it if the loss function is differentiable and lambda is not zero, and alpha reduces the search space.
# Bayes
## Basics
Prior (given): $P(M)$
Likelihood (given): $P(B|M)$
Posterior (calculated): $P(M|B) = \frac{P(M)P(B|M)}{P(B)}. P(B) = \sum P(M_i|B)$$
Prediction: $P(B'|B) = P(M=a|B)P(B'|M=a) + P(M=b|B)P(B'|M=b)$
1. state a prior $P(M=a) = 1/4, P(M=b) = 3/4$
2. state a likelihood over seeing data if a model were true. $P(B=g|M=a) = 0.8, P(B=g|M=a) = 0.2$
3. calculate a posterior over models given the data. $P(M=a|B-g) = 4/7$
4. use posterior to make predictions on future data. $P(B'=g|B=g)$
## ERM vs Bayes
- enough data very similar predictions
- if high model error, bayes can perform poorly
- bayes can take advantage of domain knowledge
-Baye's works well for big models.
## Getting the posterior based off of multiple past data
### Step 1: Get posteriors
$$
\begin{align}
P(M=a|B=gyy) &= \frac{1}{P(B=ggy)} P(M=a)[P(B=ggy|M=a)] \\
&= \frac{1}{P(B=ggy)} P(M=a)[P(B=g|M=a)P(B=g|M=a)P(B=y|M=a)]
\end{align}
$$
$$
\begin{align}
P(M=b|B=gyy) &= \frac{1}{P(B=ggy)} P(M=b)[P(B=ggy|M=b)] \\
&= \frac{1}{P(B=ggy)} P(M=b)[P(B=g|M=b)P(B=g|M=b)P(B=y|M=b)]
\end{align}
$$
### Step 2: Predict with Posterior and likelihood.
$$P(B=ggy) = P(M=a|B=ggy) + P(M=b|B=ggy)$$
then get predictions by (if B'=x): $$P(B'=x|B=ggy) = P(M=a|B=ggy)P(B'=x|M=a) + P(M=b|B=ggy)P(B'=x|M=a)$$
Note: posteriors are built from prios and likelihoods of old data.
SO it's just sum of (posteriors of old data)*(likelihoods of new data).
## Getting posterior based off multiple things in the likelihood.
### Step 1: Get the posteriors
Using this rule,
$$
P(A|B,C) = \frac{1}{P(B|C)}P(A|C)P(B|A,C)
$$
We can get the posterors:
$$
\begin{align}
P(M=a|B=ggy, W=rrc) &= \frac{1}{P(B=ggy|W=rrc)}P(M=a|W=rrc)[P(B=ggy|M=a,W=rrc)] \\
&= \frac{1}{P(B=ggy|W=rrc)}P(M=a)[P(B=g|M=a,W=r)P(B=g|M=a,W=r)P(B=y|M=a,W=c)]
\end{align}
$$
$$
\begin{align}
P(M=b|B=ggy, W=rrc) &= \frac{1}{P(B=ggy|W=rrc)}P(M=b|W=rrc)[P(B=ggy|M=b,W=rrc)] \\
&= \frac{1}{P(B=ggy|W=rrc)}P(M=b)[P(B=g|M=b,W=r)P(B=g|M=a,W=r)P(B=y|M=b,W=c)]
\end{align}
$$
### Step 2: Predict with posterior and likelihood
\begin{align}
P(B'=x|W'=r, B=ggy, W=rrc) = P(M=a|B=ggy,W=rrc)P(B'=x|W'=r,M=a) + P(M=b|B=ggy,W=rrc)P(B'=x|W'=r,M=b)
\end{align}
# Unsupervised Learning
## K-means:
Step 1: select K (number of clusters)
Step 2: Initialize K random cluster points
Step 3: Assign each point to its closest cluster point
Step 4: Calcuate the mean value of the clusters.
STep 5: Using the mean values as cluster points do the same thing until mean values do not change.
$$C(x_i) = \text{argmin}_k||x_i-m_k||^2$$
$$m_k = \frac{1}{N_k} \sum_{i:C(i)=k}x_i$$
# Question types