SVM-1 - HackMD

--- title: Spam classification problem description: duration: 200 card_type: cue_card --- ### **Problem statement** <img src="https://drive.google.com/uc?id=1SoIWP9V9xjzoE6MDw7d5ZZCgZYMPOgFh" width=700> --- title: SVM - support vector machine description: duration: 200 card_type: cue_card --- <img src="https://drive.google.com/uc?id=1avbibjOWa10nLXr-d8ZgAQE7zT9wV2ap" width=700> --- title: SVM - Geometric intution description: duration: 200 card_type: cue_card --- ### **Geometric intution behind SVM** <img src='https://drive.google.com/uc?id=1WpiQhJkbQdvrPlwMp8XgN7-vdo-lL0RB' width='700'> <img src='https://drive.google.com/uc?id=1SB2ymuy6BJo_S2AnBNlfUWc4xQv1W6pz' width='700'> <img src='https://drive.google.com/uc?id=1roB26OTVEbXCuqYknXZ6Kjwt0E47y9Cp' width='800'> --- title: Quiz 1 description: duration: 60 card_type: quiz_card --- # Question What do you mean by generalization in terms of SVM? # Choices - [ ] How far the hyperplane is from the training datapoints - [x] How accurately the SVM can predict outcomes for unseen data - [ ] How accurately the SVM classifies training datapoints --- title: SVM - Geometric intution 2 description: duration: 200 card_type: cue_card --- <img src='https://drive.google.com/uc?id=1sw4pKVOJCkaa8oFTvBwfXFi49q7qKxKi' width='800'> <img src='https://drive.google.com/uc?id=1yFdBZTjjd0cfXOurb5BDSaJN5etsxafx' width='800'> <img src='https://drive.google.com/uc?id=1_OZwfj1hXMBIWh6FlY-ro-bTmPn6dnJY' width='800'> <img src='https://drive.google.com/uc?id=1hIc5gVv8wEJ3XDtqfTFuXfAE_1E_MHSS' width='800'> <img src='https://drive.google.com/uc?id=18vNlvPS5XprZdhEbK2MqJDEoCxQ80crO' width='800'> --- title: Quiz 2 description: duration: 60 card_type: quiz_card --- # Question If, - π+ : w^T * x + b = 40 - π- : w^T * x + b = -50 then margin will be: # Choices - [ ] 10/|w| - [ ] 40/|w| - [ ] 50/|w| - [x] 90/|w| --- title: SVM - Demo description: duration: 200 card_type: cue_card --- https://jgreitemann.github.io/svm-demo <img src='https://drive.google.com/uc?id=149Xs-dDaEhXH8m90fiSlT0SPlUchUo2T' height='400' width='650'> --- title: Hard Margin SVM description: duration: 200 card_type: cue_card --- <img src='https://drive.google.com/uc?id=1mbTiyZd6SGRjCNc1ItOvZkcGSp-D5h4i' width='800'> <img src='https://drive.google.com/uc?id=15vv6vu28pQkfwAGuNpL1LeaBTojBTKhH' width='800'> <img src='https://drive.google.com/uc?id=1Xhkz2rDmyuiQCE2sRWyacuvrQa9LXlvP' width='800'> Example - <img src='https://drive.google.com/uc?id=1VgUbWlosavzPc9ftYrdt0JntpwZP6c1_' width='800'> <img src='https://drive.google.com/uc?id=1M5-AUwdrAxpqBt9HYwLCbGiwbNNnedK3' width='800'> <img src='https://drive.google.com/uc?id=1fc5esn73N3G8zpSXU9LannNTGmrVhGXs' width='800'> --- title: Quiz 3 description: duration: 60 card_type: quiz_card --- # Question What do you mean by a hard margin? # Choices - [x] The SVM allows no error in classification. - [ ] The SVM allows some error in classification. - [ ] The SVM allows high error in classification. --- title: Soft Margin SVM description: duration: 200 card_type: cue_card --- <img src='https://drive.google.com/uc?id=1pWeBbjWP3JIG5uD6yYT_I_gaEsH2xF-N' width='800'> <img src='https://drive.google.com/uc?id=1XQZZ0LtnxBxSaW-Pi3nL1w_DUM0H09zm' width='800'> Now, our optimization problem becomes: 𝑚𝑎𝑥 2 ||𝑤| | i.e., the margin along with minimizing error 𝜁′𝑖𝑠 because we're try to get the best possible classificaton. Can we think of another way to write this? Reciprocating above equation, 𝑚𝑖𝑛 ||𝑤| |2 with 𝜁′𝑖𝑠 <img src='https://drive.google.com/uc?id=1ReU2nrrZYrphbW-QEYeICUMZZrjh03pi' width='800'> --- title: Hyperparameters in SVM description: duration: 200 card_type: cue_card --- <img src='https://drive.google.com/uc?id=1H09wg9b3P9vQJ5IFPAKR5ShSJhZ7jOck' width='800'> <img src='https://drive.google.com/uc?id=11sfF-vASmq357iU7VVvzhYD21H6Dn6Wb' width='800'> Therefore, we need to find a balance here. --- title: Quiz 4 description: duration: 60 card_type: quiz_card --- # Question What would happen when you use very large value of C? # Choices - [x] We can still classify training data correctly for given value of C. - [ ] We cannot classify training data correctly for given value of C. - [ ] Can’t say for sure --- title: Algebraic intuition behind SVM description: duration: 200 card_type: cue_card --- <img src='https://drive.google.com/uc?id=1WaNSTM_ks_ztOfykq0ic4-Somgb6pX_c' width='800'> --- title: Hinge loss description: duration: 200 card_type: cue_card --- <img src='https://drive.google.com/uc?id=1WIuz_R4HWi9bZfNTq9SwfVZddOcI0Pd7' width='800'> <img src='https://drive.google.com/uc?id=1AyhARksxIXfedCiyxvnaQZ1I42xL9bo-' width='800'> <img src='https://drive.google.com/uc?id=1e4gGFbMGXggB_tgU2f1KO429oT1I_vNX' width='800'> --- title: Quiz 5 description: duration: 60 card_type: quiz_card --- # Question x1,x2,x3 are -ve datapoints which are 0.2, 3.0, 1.0 at unit distance below the π-, what will be their respective ζi? # Choices - [ ] 0.8, -2.0, 0.0 - [x] 0.2, 3.0, 1.0 - [ ] 0.8, 2.0, 0.0 --- title: Conclude description: duration: 200 card_type: cue_card --- <img src='https://drive.google.com/uc?id=11N7XqR65bSqV4gmCYpEWNm4QSC3MFxvl' width='800'> --- title: Comparison with Log Loss description: duration: 200 card_type: cue_card --- <img src='https://drive.google.com/uc?id=1QrO3vqgtp2_9S7efvTjTO3_co_522cTG' width='800'> <img src='https://drive.google.com/uc?id=16I2Ife_Ep2gJr2SW9LTo_oda3v6blnUM' width='800'> \ We will not be deriving how we get this equation. --- title: Data Imbalance description: duration: 200 card_type: cue_card --- <img src='https://drive.google.com/uc?id=1nu0yh8x8p09L3UPbiitQZ5H1KbS2qo1F' width='800'> --- title: Quiz 6 description: duration: 60 card_type: quiz_card --- # Question SVM will be impacted if there's an imbalance in the no. of datapoints belonging to each class? # Choices - [ ] True - [x] False --- title: Code description: duration: 200 card_type: cue_card --- ```python= import numpy as np import pandas as pd import matplotlib.pyplot as plt from collections import Counter from sklearn import feature_extraction, model_selection, naive_bayes, metrics, svm from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler import warnings warnings.filterwarnings('ignore') %matplotlib inline !gdown 1QViUZJ5UIBCgxB_qbOXTLs_2V48w7MWo df = pd.read_csv('Spam_processed.csv', encoding='latin-1') df.dropna(inplace = True) print(df) ``` >Output ``` type message cleaned_message 0 0 Go until jurong point, crazy.. Available only ... go jurong point crazy available bugis n great ... 1 0 Ok lar... Joking wif u oni... ok lar joking wif u oni 2 1 Free entry in 2 a wkly comp to win FA Cup fina... free entry 2 wkly comp win fa cup final tkts 2... 3 0 U dun say so early hor... U c already then say... u dun say early hor u c already say 4 0 Nah I don't think he goes to usf, he lives aro... nah nt think goes usf lives around though ... ... ... ... 5567 1 This is the 2nd time we have tried 2 contact u... 2nd time tried 2 contact u u å750 pound prize ... 5568 0 Will Ì_ b going to esplanade fr home? ì_ b going esplanade fr home 5569 0 Pity, * was in mood for that. So...any other s... pity mood suggestions 5570 0 The guy did some bitching but I acted like i'd... guy bitching acted like interested buying some... 5571 0 Rofl. Its true to its name ``` - Performing train-test split - with [CountVectorization](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) - and StandardScaler. ```python= from sklearn.model_selection import train_test_split df_X_train, df_X_test, y_train, y_test = train_test_split(df['cleaned_message'], df['type'], test_size=0.25, random_state=47) print([np.shape(df_X_train), np.shape(df_X_test)]) # CountVectorizer f = feature_extraction.text.CountVectorizer() X_train = f.fit_transform(df_X_train) X_test = f.transform(df_X_test) # StandardScaler scaler = StandardScaler(with_mean=False) # problems with dense matrix X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) print([np.shape(X_train), np.shape(X_test)]) print(type(X_train)) ``` >Output ``` [(4173,), (1392,)] [(4173, 7622), (1392, 7622)] <class 'scipy.sparse._csr.csr_matrix'> ``` Let's train Linear SVM on the given Spam/Ham data. ```python= # SVC from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV params = { 'C': [1e-4, 0.001, 0.01, 0.1, 1,10] # which hyperparam value of C do you think will work well? } svc = SVC(class_weight={ 0:0.1, 1:0.5 }, kernel='linear') clf = GridSearchCV(svc, params, scoring = "f1", cv=3) clf.fit(X_train, y_train) ``` >Output ``` GridSearchCV GridSearchCV(cv=3, estimator=SVC(class_weight={0: 0.1, 1: 0.5}, kernel='linear'), param_grid={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10]}, scoring='f1') estimator: SVC SVC SVC(class_weight={0: 0.1, 1: 0.5}, kernel='linear') ``` ```python= res = clf.cv_results_ for i in range(len(res["params"])): print(f"Parameters:{res['params'][i]} \n Mean score: {res['mean_test_score'][i]} \n Rank: {res['rank_test_score'][i]}") ``` >Output ``` Parameters:{'C': 0.0001} Mean score: 0.6566305780023073 Rank: 6 Parameters:{'C': 0.001} Mean score: 0.7742322485787693 Rank: 1 Parameters:{'C': 0.01} Mean score: 0.767533370474547 Rank: 2 Parameters:{'C': 0.1} Mean score: 0.7649416969151316 Rank: 3 Parameters:{'C': 1} Mean score: 0.7649416969151316 Rank: 3 Parameters:{'C': 10} Mean score: 0.7649416969151316 Rank: 3 ``` As you can see, - we get the best performance when $C=0.001$, - with F1 Score of 0.77. \ Now implementing this SVM on the test data. ```python= svc = SVC(C=0.001,class_weight={ 0:0.1, 1:0.5 }, kernel='linear') svc.fit(X_train, y_train) y_pred = svc.predict(X_test) print(metrics.f1_score(y_test,y_pred)) ``` >Output ``` 0.8835820895522388 ``` Linear SVM performs much well - on the Spam/Ham data - with F1 Score of 0.88 - when using class weights.