Support Vector Machines

# Support Vector Machines ###### tags: `Machine Learning II` ## Classification SVM's %% #TODO: Look at the MIT Lecture https://www.youtube.com/watch?v=_PwhiWxHK8o %% SVM's target is to ==maximize the separability== between labels. In order to do this is, the chosen decision boundary must maximize the **margin** >The **margin** is the smallest distance between the classification boundary and a class observation. #TODO: make a margin plot This means that data points near to the decision regions are **Critical**, and must be treated with special care, specially if data points with different labels are similar. >For example, two similar data points with different labels would be motorcicles and mountain bikes >[!info] >In cases with a limited number of samples, kernel methods such as SVM are a better options than deep learning ones. #### Functional margin definition #TODO: Gotta explain this in detail. The **functional margin** is defined as: $$y_i\{\mathbf{w}^T\mathbf{x}_i+b\}$$ ### SVM's for non linear problems As the mayority of problems in real life are non-linear, in order to model them SVM's, we need to transform the space with what we called in Calculus a '*Coordinate change*' ### Kernel SVM's Kernel functions take two arguments $\mathbf{x}_1$ and $\mathbf{x_2}$ and map it into a real number. They are also **symmetric**, meaning that the order of the arguments does not change the result of the function and they are **positive semidefinite** meaning that for a sample $\{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n\}$ with a matrix that stores all kernels $k(\mathbf{x_i}, \mathbf{x}_j)$, its matrix is guaranteed to have only non-negative eigenvalues. >The most used kernels are the **Linear** kernel, the **Radial** kernel and the **Polynomial** kernel. Also **combinations of kernels** are frequently used. ## Types of kernels ### RBF kernel #TODO: Need to explain this better (this is more or less what the slides contain, but it is rather poor). This kernel displays **radial symmetry**. This means that all observations that are placed at the same Euclidean distance from a given Support Vector have the same activation value. For a given kernel, a lower value of $\gamma$ implies a higher range of activation, which is decrease as $\gamma$ is increased. In broad terms it could be said that a low value of $\gamma$ gives a Support Vector more *range* whereas a high one gives it a more reduced one. ### Polynomial kernel #TODO ### Sigmoid kernel #TODO ### Regularization #TODO: from the slides In Machine Learning II we want to learn from training data and approximate our target function. It is however, desirable to not totally imitate it as we would fail in the generalization step, by **overfitting**. To that regard we introduce a **loss function** and **regularization** techniques to try to prevent this from happening. ### Functions to minimize the zero-one loss **Zero one loss** means that every correctly classified instance is a 1 and every incorrect classification is a 0 <span style="color:red">red</span>. It is not differentiable. **Hinge Loss**: Loss that is an upper bound to the zero one loss. It is continuous but not differentiable. >Hinge loss is the loss used in Classification SVM's **Square Loss**: This function gives a higher penalty to outliers. It is both continuous and differentiable. ## Regression SVM's #NOTE: quitamos esto? ### Linear regression Uses MSE to interpolate throught points. #### Ridge regression As MSE is very sensitive to outliers, we introduce a **regularization/penalization term** $\lambda \mathbf{w}^T \mathbf{w}$, which translates to including the **norm of the weight vectors** to the general minimization problem. ##### Duality in ridge regression If we have to high value for the penalization parameter $\lambda$, slide 28 ### Lack of sparsity in Kernel Ridge Regression ## Support Vector Regression Support vector regression adapts the kernel trick, sparsity and soft margin principles