Regression - HackMD

# Regression ## k-nearest-neighbors (knn) - Given a small dataset in a graph and a query point, you will be asked what is predictd for query point for different k values. - Algorithm: take query point q and find the k nearest points (euclidian distance). The value of q will be the average of the value of the k points. ## Regression Trees - Given a small dataset, produce a regression stump that minimizes the MSE - Split the dataset in half. Take the average value of both sides. These will be the values of the leaves. Get the MSE. Use the split that minimizes the MSE. - For all splits, calcualte the MSE take the smallest one. - L2 norm ## Linear Regression - Given the following table: | Weight Vector | MSE | MAE | L2-norm | L1-norm | |---------------|-----|-----|---------|---------| | w1 | | | | | | w2 | | | | | | w3 | | | | | Imagine minimizing __ loss with __ regularizer but 1 is unknown, then what values of w are a solution? We must give answer as a set. "which w are a solution for some value of lambda"? - Will try to trick with L2-norm and L2-norm squared ## Cross Validation - given a small dataset, what is the cross validation error using n fold and k nearest neighbors? - Leave one out cross validation: For each point, find the predicted value of that point if that point were not in the dataset. Calculate the MAE/MSE for each point and average. If n folds just leave n points out. # Classification ## k-nearest-neighbors - Instead of taking the average of the k nearest points, find the query point will be the majority class of the k nearest points. ## Regression Trees - Choose the split the minimizes the gini index. - Gini index for all leaves leaf: $$1 - \sum_{i}^{n}p_{i}$$. Where n is the number of classes. and $p_{i}$ is the probability of said class in the split. - Compute a weighted sum of all gini indexes of the leaf nodes w.r.t (total examples in leaf)/(total examples in dataset) - The smaller the weighted sum the better the split. ## Linear Regression - Same type of question in regression with the table. - The age-old logistic regression question: Is logistic regression: Is logistic regression used for classification? **100% on exam** - NNs: Given an expression graph, evaluate it and compute the gradient. There is a template for this problem on Piazza. Can elave the answer unevulated and don't distribute **100% on exam** - L1 norm: Lasso Regression, L2 norm Ridge Regression. - Basically weighted sum with \lambda as the single weight. Diff between l1 and l2 is to square or not. # Kernels ## Regular Ridge Regression Learning objective of regular ridge regression: $$R(w) = ||y-Xw||^2 + \lambda||w||^2$$ Now, we calculate the gradient $\nabla R(w)$: $$\nabla R(w) = -2X^T(y-Xw)+2\lambda w$$ Set this gradient to 0 and solve for $w$ $$w = (X^TX+\lambda I)^{-1}X^Ty$$ Note that $X^TX \in \mathbb{R}^{D\times D}$ and $X^T \in \mathbb{R}^{D}$. Thus the result will be of a dimension D. This is what we want. Now let's define $C$ where $C$, $$C = \sum_{n=1}^{N}x^{(n)}x^{(n)^T}$$ We can subsitute this back into $w$ to get the following, $$w = (C+\lambda I)^{-1}X^Ty$$ Prediction will simply be: $$\hat{y}=w^Tx$$ - Time to learn: $D^2N$ - Solving linear system: $D^3$ - Total complexity: Learning: $D^2N+D^3$, Prediction: $D$ ## Dual Ridge Regression Consider the following theorem: $$(PQ + I)^{-1}P = P(QP+I)^{-1}$$ Now, by setting $P=(1/\lambda)X^T, Q=X$ we get teh following: $$w = X^T(XX^T + \lambda I)^{-1}y$$ Observe that $X^TX$ is a $N\times D$ matrix while $XX^T$ is a $N\times N$ matrix. Now let $$\alpha = (XX^T + \lambda I)^{-1}y$$ To predict we simply do the following: $$\hat{y}=w^Tx = \alpha^TXx = \sum_{n=1}^{N}a_{n}\cdot x^{(n)}$$ Now let $K$ be the following: $$K_{nm} = \sum_{n=1}^{N}x^{(n)} \cdot x^{(m)}$$. We can subsitute this back in to $\alpha$ to get the following: $$\alpha = (K+\lambda I^{-1}y$$ Prediction will then be the following: $$\hat{y} = \sum_{n=1}^{N}\alpha_{n}x^{(n)}\cdot x$$ - Time to learn: $D^2N$ - Solving linear system: $N^3$ - Total complexity: Learning: $DN^2+N^3$, Prediction $ND$ Here is a helpful table outlining the time complexity of ridge and dual ridge regression: | | Regular Ridge | Dual Ridge | |------------|---------------|------------| | Learning | $D^2N + D^3$ | $DN^2+N^3$ | | Prediction | D | ND | ## Kernel Ridge Regression The important idea is that calcualting $h(x)\cdot h(x)'$ where $h$ is a basis expansion on $x$ can be computed much faster than $h(x)$ itself. We do this by using kernels. We can define $h(x)\cdot h(x')$ as the following: $$k(x,x')$$ We can subsitute $k$ into our dual ridge regression formulas like the following: Learning: $$K_{nm} = \sum_{n=1}^{N}k(x^{(n)}, x^{(m)}$$ $$\alpha = (K +\lambda I)^{-1}y$$ Prediction: $$\hat{y} = \sum_{n=1}^{N}\alpha_{n}k(x^{(n)},x)$$ - Mercer's Theorem: If the kernel function is posivite-definite then you can use the kernel function. (eigenvalues cannot be negative). - Representer Theorem: You can take the use it if the loss function is differentiable and lambda is not zero, and alpha reduces the search space. # Bayes ## Basics Prior (given): $P(M)$ Likelihood (given): $P(B|M)$ Posterior (calculated): $P(M|B) = \frac{P(M)P(B|M)}{P(B)}. P(B) = \sum P(M_i|B)$$ Prediction: $P(B'|B) = P(M=a|B)P(B'|M=a) + P(M=b|B)P(B'|M=b)$ 1. state a prior $P(M=a) = 1/4, P(M=b) = 3/4$ 2. state a likelihood over seeing data if a model were true. $P(B=g|M=a) = 0.8, P(B=g|M=a) = 0.2$ 3. calculate a posterior over models given the data. $P(M=a|B-g) = 4/7$ 4. use posterior to make predictions on future data. $P(B'=g|B=g)$ ## ERM vs Bayes - enough data very similar predictions - if high model error, bayes can perform poorly - bayes can take advantage of domain knowledge -Baye's works well for big models. ## Getting the posterior based off of multiple past data ### Step 1: Get posteriors $$ \begin{align} P(M=a|B=gyy) &= \frac{1}{P(B=ggy)} P(M=a)[P(B=ggy|M=a)] \\ &= \frac{1}{P(B=ggy)} P(M=a)[P(B=g|M=a)P(B=g|M=a)P(B=y|M=a)] \end{align} $$ $$ \begin{align} P(M=b|B=gyy) &= \frac{1}{P(B=ggy)} P(M=b)[P(B=ggy|M=b)] \\ &= \frac{1}{P(B=ggy)} P(M=b)[P(B=g|M=b)P(B=g|M=b)P(B=y|M=b)] \end{align} $$ ### Step 2: Predict with Posterior and likelihood. $$P(B=ggy) = P(M=a|B=ggy) + P(M=b|B=ggy)$$ then get predictions by (if B'=x): $$P(B'=x|B=ggy) = P(M=a|B=ggy)P(B'=x|M=a) + P(M=b|B=ggy)P(B'=x|M=a)$$ Note: posteriors are built from prios and likelihoods of old data. SO it's just sum of (posteriors of old data)*(likelihoods of new data). ## Getting posterior based off multiple things in the likelihood. ### Step 1: Get the posteriors Using this rule, $$ P(A|B,C) = \frac{1}{P(B|C)}P(A|C)P(B|A,C) $$ We can get the posterors: $$ \begin{align} P(M=a|B=ggy, W=rrc) &= \frac{1}{P(B=ggy|W=rrc)}P(M=a|W=rrc)[P(B=ggy|M=a,W=rrc)] \\ &= \frac{1}{P(B=ggy|W=rrc)}P(M=a)[P(B=g|M=a,W=r)P(B=g|M=a,W=r)P(B=y|M=a,W=c)] \end{align} $$ $$ \begin{align} P(M=b|B=ggy, W=rrc) &= \frac{1}{P(B=ggy|W=rrc)}P(M=b|W=rrc)[P(B=ggy|M=b,W=rrc)] \\ &= \frac{1}{P(B=ggy|W=rrc)}P(M=b)[P(B=g|M=b,W=r)P(B=g|M=a,W=r)P(B=y|M=b,W=c)] \end{align} $$ ### Step 2: Predict with posterior and likelihood \begin{align} P(B'=x|W'=r, B=ggy, W=rrc) = P(M=a|B=ggy,W=rrc)P(B'=x|W'=r,M=a) + P(M=b|B=ggy,W=rrc)P(B'=x|W'=r,M=b) \end{align} # Unsupervised Learning ## K-means: Step 1: select K (number of clusters) Step 2: Initialize K random cluster points Step 3: Assign each point to its closest cluster point Step 4: Calcuate the mean value of the clusters. STep 5: Using the mean values as cluster points do the same thing until mean values do not change. $$C(x_i) = \text{argmin}_k||x_i-m_k||^2$$ $$m_k = \frac{1}{N_k} \sum_{i:C(i)=k}x_i$$ # Question types