Support Vector Machine

--- title: Support Vector Machine tags: Machine Learning, CoderSchool --- # Overview The objective of the Support Vector Machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. Possible hyperplanes: ![](https://i.imgur.com/arI1Uzn.png) ![](https://i.imgur.com/LYaGGdo.png) To separate the 2 classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence. # Hyperplanes and Support Vectors ![](https://i.imgur.com/GzqDJFB.png) Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3. ![](https://i.imgur.com/a5FV1Yq.jpg) Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM. # How to Indentify the Right Hyperplane ## Scenario 1 ![](https://i.imgur.com/WYDDKJx.png) We select the hyperplane which segregates the two classes better. In this scenario, hyperplane B has excellently performed this job. ## Scenario 2 ![](https://i.imgur.com/PHyk8pH.png) Here, maximizing the distances between nearest data point (either class) and hyperplane will help us to decide the right hyperplane: C. ## Scenario 3 ![](https://i.imgur.com/agJdDwW.png) The hyperplane B has higher margin compared to A. But, here is the catch, SVM selects the hyperplane which classifies the classes accurately prior to maximizing margin. Here, hyperplane B has a classification error and A has classified all correctly. Therefore, the right hyperplane is A. ## Scenario 4 ![](https://i.imgur.com/sH5iJhr.png) One star at other end is like an outlier for star class. SVM has a feature to ignore outliers and find the hyperplane that has maximum margin. Hence, we can say, SVM is robust to outliers. ![](https://i.imgur.com/9jLLi8E.png) ## Scenario 5 ![](https://i.imgur.com/bAuvCQi.png) We add a new feature $z=x^2+y^2$. This is the kernel trick. Now, let’s plot the data points on axis x and z: ![](https://i.imgur.com/IiF8FGi.png) In above plot, points to consider are: * All values for z would be positive always because z is the squared sum of both x and y * In the original plot, red circles appear close to the origin of x and y axes, leading to lower value of z and star relatively away from the origin result to higher value of z. ![](https://i.imgur.com/FeioUah.png) ![](https://i.imgur.com/WP6PXMS.png) # Tuning Parameters Tuning parameters value for machine learning algorithms effectively improves the model performance. Let’s look at the list of parameters available with SVM. In: ```python= sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma=0.0, coef0=0.0, shrinking=True, probability=False,tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, random_state=None) ``` The important parameters having higher impact on model performance are “kernel”, “gamma” and “C” (regularization). ## Margin A margin is a separation of line to the closest class points. ![](https://i.imgur.com/lOfUN67.png) ![](https://i.imgur.com/rSRJHIp.png) ## Regularization The Regularization parameter (often termed as C parameter in python’s sklearn library) tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. ![](https://i.imgur.com/Km3rYPk.png) Let’s look at the example where we use two feature of the Iris data set to classify their classes. ![](https://i.imgur.com/VFmDdTB.png) We should always look at the cross validation score to have effective combination of these parameters and avoid over-fitting. ## Kernel Kernel is a way of computing the dot product of two vectors $𝐱$ and $𝐲$ in some (possibly very high dimensional) feature space, which is why kernel functions are sometimes called "generalized dot product". Suppose we have a mapping $𝜑:ℝ^𝑛→ℝ^𝑚$ that brings our vectors in $ℝ^𝑛$ to some feature space $ℝ^𝑚$. Then the dot product of $𝐱$ and $𝐲$ in this space is $𝜑(𝐱)^𝑇𝜑(𝐲)$. A kernel is a function $𝑘$ that corresponds to this dot product, i.e. $𝑘(𝐱,𝐲)=𝜑(𝐱)^𝑇𝜑(𝐲)$. Kernels give a way to compute dot products in some feature space without even knowing what this space is and what is $𝜑$. Kernels offer the user the option of transforming nonlinear spaces into linear ones. We have various options available with kernel like, “linear”, “rbf”, ”poly” and others (default value is “rbf”). Here “rbf” and “poly” are useful for non-linear hyperplane. ![](https://i.imgur.com/kdrW2MN.png) ![](https://i.imgur.com/OmVB2pt.png) I would suggest you to go for linear kernel if you have large number of features (>1000) because it is more likely that the data is linearly separable in high dimensional space. Also, you can RBF but do not forget to cross validate for its parameters as to avoid over-fitting. ## Gamma The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma, points far away from plausible seperation line are considered in calculation for the seperation line. Where as high gamma means the points close to plausible line are considered in calculation. ![](https://i.imgur.com/IPJ1nrz.png) ![](https://i.imgur.com/oj1YRUr.png) ![](https://i.imgur.com/V2z6GWf.png) Too high gamma value can cause over-fitting problem. # SVM Implementation in Python ## Data Preparation & Visualization The dataset we will be using to implement our SVM algorithm is the Iris dataset. https://www.kaggle.com/jchen2186/machine-learning-with-iris-dataset/data In: ```python= import pandas as pd df = pd.read_csv('/Users/rohith/Documents/Datasets/Iris_dataset/iris.csv') df = df.drop(['Id'],axis=1) target = df['Species'] s = set() for val in target: s.add(val) s = list(s) rows = list(range(100,150)) df = df.drop(df.index[rows]) ``` Since the Iris dataset has 3 classes, we will remove 1 of the classes. This leaves us with a binary class classification problem. Also, there are 4 features available for us to use. We will be using only 2 features, i.e Sepal length and Petal length. We take these 2 features and plot them to visualize. From the below graph, you can infer that a linear line can be used to separate the data points. In: ```python= import matplotlib.pyplot as plt x = df['SepalLengthCm'] y = df['PetalLengthCm'] setosa_x = x[:50] setosa_y = y[:50] versicolor_x = x[50:] versicolor_y = y[50:] plt.figure(figsize=(8,6)) plt.scatter(setosa_x,setosa_y,marker='+',color='green') plt.scatter(versicolor_x,versicolor_y,marker='_',color='red') plt.show() ``` Out: ![](https://i.imgur.com/UYXxkhS.png) ## Split the Data We extract the required features and split it into training and testing data. 90% of the data is used for training and the rest 10% is used for testing. In: ```python= from sklearn.utils import shuffle from sklearn.cross_validation import train_test_split import numpy as np ## Drop rest of the features and extract the target values df = df.drop(['SepalWidthCm','PetalWidthCm'],axis=1) Y = [] target = df['Species'] for val in target: if(val == 'Iris-setosa'): Y.append(-1) else: Y.append(1) df = df.drop(['Species'],axis=1) X = df.values.tolist() ## Shuffle and split the data into training and test set X, Y = shuffle(X,Y) x_train = [] y_train = [] x_test = [] y_test = [] x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.9) x_train = np.array(x_train) y_train = np.array(y_train) x_test = np.array(x_test) y_test = np.array(y_test) y_train = y_train.reshape(90,1) y_test = y_test.reshape(10,1) ``` ## Build SVM Model Using the Numpy Library In: ```python= ## Support Vector Machine import numpy as np train_f1 = x_train[:,0] train_f2 = x_train[:,1] train_f1 = train_f1.reshape(90,1) train_f2 = train_f2.reshape(90,1) w1 = np.zeros((90,1)) w2 = np.zeros((90,1)) epochs = 1 alpha = 0.0001 while(epochs < 10000): y = w1 * train_f1 + w2 * train_f2 prod = y * y_train print(epochs) count = 0 for val in prod: if(val >= 1): cost = 0 w1 = w1 - alpha * (2 * 1/epochs * w1) w2 = w2 - alpha * (2 * 1/epochs * w2) else: cost = 1 - val w1 = w1 + alpha * (train_f1[count] * y_train[count] - 2 * 1/epochs * w1) w2 = w2 + alpha * (train_f2[count] * y_train[count] - 2 * 1/epochs * w2) count += 1 epochs += 1 ``` $α(0.0001)$ is the learning rate and the regularization parameter $λ$ is set to $1/epochs$. Therefore, the regularizing value reduces, the number of epochs increases. In: ```python= from sklearn.metrics import accuracy_score ## Clip the weights index = list(range(10,90)) w1 = np.delete(w1,index) w2 = np.delete(w2,index) w1 = w1.reshape(10,1) w2 = w2.reshape(10,1) ## Extract the test data features test_f1 = x_test[:,0] test_f2 = x_test[:,1] test_f1 = test_f1.reshape(10,1) test_f2 = test_f2.reshape(10,1) ## Predict y_pred = w1 * test_f1 + w2 * test_f2 predictions = [] for val in y_pred: if(val > 1): predictions.append(1) else: predictions.append(-1) print(accuracy_score(y_test,predictions)) ``` We now clip the weights as the test data contains only 10 data points. We extract the features from the test data and predict the values. We obtain the predictions and compare it with the actual values and print the accuracy of our model. ## SVM in scikit-learn There is another simple way to implement the SVM algorithm. We can use the Scikit learn library and just call the related functions to implement the SVM model. In: ```python= from sklearn.svm import SVC from sklearn.metrics import accuracy_score clf = SVC(kernel='linear') clf.fit(x_train,y_train) y_pred = clf.predict(x_test) print(accuracy_score(y_test,y_pred)) ``` # Pros * It works really well with clear margin of separation. * It is effective in high dimensional spaces. * It is effective in cases where number of dimensions is greater than the number of samples. * It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. # Cons * It doesn’t perform well when we have large data set because the required training time is higher. * It also doesn’t perform very well when the data set has more noise, i.e. target classes are overlapping. * SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is related SVC method of Python scikit-learn library. # Sources >[1]https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72 [2]https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 [3]https://profs.info.uaic.ro/~ciortuz/SLIDES/svm.pdf [4]http://www.cs.cornell.edu/projects/btr/bioinformaticsschool/slides/joachims.pdf [5]https://educationalresearchtechniques.com/2016/07/15/basics-of-support-vector-machines/ [6]https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/