# Week 6 ## Monday: - Regularization is used to control the bias \& variance tradeoff: - High bias: simple model, makes strong assumption about the data, lead to underfit the data, high training and test error - High variance: complex model, the model fits to every points in the training data, lead to overfitting, low training error but high test error - Use regularization techniques to control the model's parameter to avoid overfitting the data: L1, L2 or elasticnet - Details on regularization techniques: **L2 Regularization** (or Ridge Regression in Linear Regression): $$ L_2 = ||w||_2^2 = \sum_{j=1}^nw_j^2 $$ Ridge Regression: $$ min_w \frac{1}{m} \sum_{i=1}^{m}{(\hat{y}^{(i)} - y^{(i)})^2} + \alpha ||w||_2^2 $$ **L1 Regularization** (or Lasso Regression in Linear Regression): $$ L_1 = ||w||_1 = \sum_{j=1}^n|w_j| $$ Lasso Regression: $$ min_w \frac{1}{m} \sum_{i=1}^{m}{(\hat{y}^{(i)} - y^{(i)})^2} + \alpha ||w||_1 $$ **Elastic Net** (a middle ground between Ridge Regression and Lasso Regression) $$ min_w \frac{1}{m} \sum_{i=1}^{m}{(\hat{y}^{(i)} - y^{(i)})^2} + \alpha r ||w||_1 + 0.5\alpha(1-r)||w||_2^2 $$ * $\alpha = 0$ is equivalent to an ordinary least square, solved by Linear Regression * $0 \leq r \leq 1$, for $r=0 \rightarrow$ Ridge Regression, for $r=1 \rightarrow$ Lasso Regression. For $0 < r < 1$, the penalty is a combination of L1 and L2. * This is equivalent to $a||w||_1 + b ||w||_2^2$ where $\alpha=a+b$ and $r = \frac{a}{a+b}$ - K-Nearest Neighbors: - instance-based learning: the label is assigned based on the similarity between new node with its neighbors - non-parametric model - distance functions: - Manhattan distance L1 - Euclidean distance L2 - Use K-fold validation to evaluate the performance of a ML model: estimate the average performance of k-models on different subsets of the data - Curse of dimensionality: in high dimensions, data points lie far away from each other, lead to distance function not able to capture the distance between data points #### Naive Bayes model: $$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$ $P(A)$ is often referred to as the **prior**, $P(A|B)$ as the **posterior**, and $P(B|A)$ as the **likelihood**. - Strong assumption: conditional independence between input features - Use this assumption to simplify the Bayes rule to infer y from training data x $$ \begin{align}\begin{aligned} \Rightarrow \hat{y} &= \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y) \\ &= \operatorname*{argmax}_y \; \sum_{i = 1}^{n} \log(P(x_i | y)) + \log(P(y)) \end{aligned}\end{align} $$ ## Tuesday: - SVM: finds the separating hyperplane that maximize the margin $\gamma$ which is the distance between the closest points to the hyperplane. - Hard margin vs soft margin: hard margin punishes the weight for any data points that lie in between the support vectors. While soft margin allows some data points to be in the area of support vectors - This is controlled by the C parameters in sklearn library - We use C (hyperparameter) to allow points within our margins ('on the street') - Higher C = strong penalty. The more points we allow on the street, the bigger the hinge loss will become. When C is big enough, no point is even allowed within our margins. This is hard margin, and this can lead to overfitting - Lower C = weak penalty. More points are allowed within the margins ## Wednesday: - Decision Tree - Hyperparameters: - Tree depth: the number of split until the samples in each node are pure or less than min sample leaf - Min sample leaf:the minimum number of samples per leaf to stop splitting - Does not require to normalize the data - Gini impurity: measures how good a split is, the lower the value of gini score, the better - Bagging: - Train many weak classifiers on different sampled subset of the data, then combine their predictions using some sort of voting mechanism. * Sample the $k$ datasets $D_1,\dots,D_k$, and $D_i$ is picked uniformly at random with replacement from $D$. * For each $D_i$ train a classifier $h_i()$. * The final classifier is $h(\mathbf{x})=\frac{1}{k}\sum_{j=1}^k h_j(\mathbf{x})$ - Boosting: - Train classifiers to learn from mistakes of previous learners, only focus on misclassified data - ## Thursday - Friday (off) ## Monday: - Unsupervised learning - Clustering - Hierarchical Clustering - PCA - reduce dimensions