# WEEK 6 (21-25/12/20)
## KNN & Naive Bayes (Mon, 21/12/20)
### **Bias-variance tradeoff**
**==Error due to Bias==**: The error due to ==bias=== is taken as the difference between the ==expected (or average) prediction== of our model and the correct value which we are trying to predict.
**==Error due to Variance==**: The error due to ====variance is taken as the variability of a model prediction for a given data point. It captures ==how much the predictions for a GIVEN POINT vary when using DIFFERENT MODELS== to make these predictions

---> ==High-variance== learning methods may be able to represent their training set well but are at risk of ==overfitting==
==**Detecting High Bias and High Variance**==
* High Bias: Training error is high:
* Use more complex model (e.g. higher polynomial, or kernel methods)
Adding features that have strong predictive powers
* High Variance: Training error is low, but it is much lower than test error (test error is high)
* Add more training data
Reduce the model complexity (e.g. Regularization)
### **Regularizations for linear model**
---> ==**reduce overfit**==
By adding a regularization term to the loss function:
* Try to fit the training data well (by keep small loss)
* Try to ==keep the weights small== so no weight (weights) will stand out and cause overfitting
==**L2 Regularization**==

==**Ridge regression**==

---
---
**==L1 Regularization==**:

==**Lasso Regression:**==

---
---
**==Elastic Net (a middle ground between Ridge Regression and Lasso Regression)==**
$$
min_w \frac{1}{m} \sum_{i=1}^{m}{(\hat{y}^{(i)} - y^{(i)})^2} + \alpha r ||w||_1 + 0.5\alpha(1-r)||w||_2^2
$$
Alpha is used to penalize the weights
* When alpha is small ---> less regularization, remains overfitting
* When alpha is large ---> more regularization, underfitting
==**Hyperparameters**==: things you change to change your model
e.g. r, alpha, learning rate, max_iter,...
### **K-Nearest Neighbors**
KNN is a **non-parametric** classifier algorithms. It simply looks at the K points in the training set that are ==nearest== to the input x, and assign the most common label to the output.
K---> hyperparameter
==Parametric models==:
ADVANTAGE: faster to predict
DISADVANTAGE: making stronger assumptions about the data distribution.
==Non-parametric==:
ADVANTAGE:more flexible
DISADVANTAGE: computationally expensive.
==**The distance function**==
**Manhattan distance**, or **$L_1$ distance**, or **$L_1$ norm**
$$
dist(x, x') = \sum_{i=1}^{n}{|x_i - x'_i|}
$$
---
**Euclidean distance**, or **$L_2$ distance**, or **$L_2$ norm**
$$
dist(x, x') = \sqrt{\sum_{i=1}^{n}{(x_i - x'_i)^2}}
$$
SMALL K: higher chance of overfit
LARGE K: underfit
### **K-fold cross validation**: to find the best set of hyperparameters

* Randomly split the training dataset into k folds without replacement, where 𝑘−1 folds are used for the model training, and one fold is used for performance evaluation.
* Do that k times to obtain k models and performance estimates.
* Calculate the average performance of the models (ie final validation scores)
==**Curse of Dimensionality**==
When number of features (p) is large, and dataset size (n) is still the same, KNN performs worst
## Linear SVM & Kernel SVM (Tue, 22/12/20)
The Support Vector Machine SVM is a ==**linear classifier**== that finds the maximum margin separating hyperplane. In other words, SVM find the separating hyperplane that **==maximizes the distance to the closest data points from both classes==**.

distance from the closest green point to the blue line = margin of green class
**==FORMULA FOR MARGIN:==**
$$
\Rightarrow
\gamma(\mathbf{w},b)=\min_{\mathbf{x}\in D}\frac{\left | \mathbf{w}^T\mathbf{x}+b \right |}{\left \| \mathbf{w} \right \|_{2}}
$$
--->$$
\begin{align}
&\min_{\mathbf{w},b}\mathbf{w}^T\mathbf{w}&\\
&\textrm{s.t.} \ \ \ \forall i \ y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) \geq 1 &
\end{align}
$$
==**Hard margin**==: assume training data is linearly separable
**==Support vectors==** are any points that satisfy the condition
$$
y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) = 1.
$$
---
**==Soft margin==**:
---> The ==hinge loss== fucntion:
$$
\min_{\mathbf{w},b}\underbrace{\mathbf{w}^T\mathbf{w}}_{l_{2}-regularizer} + C\ \sum_{i=1}^{m}\underbrace{\max\left [ 1-y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b),0 \right ]}_{hinge-loss}
$$
**==C==**: allow points within the margins
==Higher C== = strong penalty. More points allow on the street, the bigger the hinge loss will become. When C is big enough, no point is even allowed within our margins. ---> hard margin, and this can lead to overfitting
==Lower C== = weak penalty. More points are allowed within the margins
---
### ==**Kernel method**==
The main idea of kernel methods is projecting the data into a high dimensional space where it is linearly separable and find a hyperplane in this space.
## Tree-based Model, Bagging, and Boosting (Wed, 23/12/20)
### **Decision Tree Classifier**

**==Gini Score==**: measure of impurity (how mixed the class are in 1 group, the higher the Gini the more equally distributed number of classes
* MAX: when all targets are equally distributed
* MIN: when all observations belong to one class
==Which feature to split?==
* Select feature, split data based on the feature
* Compute Gini scores of the 2 groups created by the split
* Compute an average value of the 2 Gini scores as that feature's Gini-value
* The feature with the lowest Gini-value is the chosen to split the data
---> Use the feature split with the **==SMALLEST==** Gini score
---> Decision Tree can overfit/underfit (can "memorize" the data set)
Decision trees are ==**prone to overfitting**==, especially when a tree is particularly deep. This is due to the **==amount of specificity==** we look at leading to smaller sample of events that meet the previous assumptions. This small sample could lead to unsound conclusions.
---> Solution: control the depth (max_depth)
As max depth of decision tre increase, error on train set goes down, on test set goes up ---> Overfitting
* max_features : set using a probabililty (eg 0.5)
### **Random Forest Classifier**
**STEPS**:
* **Sample the $k$ datasets** $D_1,\dots,D_k$, and $D_i$ **==with replacement==** from $D$
* For each $D_i$ train a Decision Tree with **subsample features without replacement** (this increases the variance of the trees)
* The final classifier is $h(\mathbf{x})=\frac{1}{k}\sum_{j=1}^k h_j(\mathbf{x})$
## Unsupervised Learning (Thu, 23/12/20)
(Working with un-labelled data)
### Clustering
#### ==**K-means**==: put data into groups
.fit(X): unsupervised therefore only fit X
### ==**Principal components analysis PCA**==
PCA reduces the dimensionality of the data set, allowing most of the variability to be explained using fewer variables
---> PURPOSE: maximize variance
## Review & Module Test (Fri, 24/12/20)
### **NAIVE BAYES CLASSIFIER**
#### Conditional probability
$$
P(A|B) = \frac{P(A \cap B)}{P(B)}
$$
**The chain rule** follows from the definition of conditional probability:
$$
P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)
$$
---> **Bayes' rule**:
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$
$P(A)$: **the prior**
$P(A|B)$: **the posterior**
$P(B|A)$: **the likelihood**.
e.g.
$$
P(label|data)
$$
--->==ASSUMPTION==: ==conditional independence== between every pair of features given the value of the class variable
==**ADVANTAGES**==
* For classification and spam filtering.
* Require a small amount of training data to estimate the necessary parameters
==**DISADVANTAGES**==
* Prone to underfit
#### **Multinomial features**
**NAIVE BAYES**
$$
P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}
{P(x_1, \dots, x_n)}
$$
$$
P("ad"|spam)=P(x_i|y)=\frac{\text{# of times word } \alpha \text{ appears in all spam emails}}{\text{# of words in all spam emails combined}}.
$$
# ==**MISCELLANEOUS SHIT**==
np.c_ : add new columns for numpy
PIPELINE: follows an order
# **==TEST==**
1. Linear regression
weight, loss function, gradient decent steps,...
how to evaluate model, overfitting, underfitting
**no formulas, no maths**
2. KNN regularization
bias, variance, trade off,
what can you do with KNN (regression-yes(take average))
when will KKN overfit/underfit?
3. Naive Bayes
formula, assumption?
4. Linear SVM
what is support vector?
dotted lines, hard/soft margin with regards to
C value
5. Kernel SVM
project on a high dimension to classify
kernel trick:
6. Decision tree
score used to split tree? How to split tree
hyperparameter to avoid underfit/overfit
7. Bagging boosting (combine all model then take average)
random forest
8. Unsupervised learning
Clustering (kmeans, hierarchical)
HOW TO FINE TUNE SHIT (PIPELINE)