Support Vector Machine

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Intuition

The goal of SVM (Support Vector Machine) algorithm is to draw a line that will divide your training data into positive and negative samples as best as possible, then classify new data by determining on which side of the hyperplane they lie on.

Consider the following figure in which x's represent positive training examples, o's denote negative training examples, a decision boundary (this is the line given by the equation

θ^{T} x = 0

, and is also called the separating hyperplane) is also shown, and three points have also been labeled A, B and C.

Notice that the point A is very far from the decision boundary. If we are asked to make a prediction for the value of y at A, it seems we should be quite confident that

y = 1

there.

Conversely, the point C is very close to the decision boundary, and while it’s on the side of the decision boundary on which we would predict

y = 1

, it seems likely that just a small change to the decision boundary could easily have caused our prediction to be

y = 0

. Hence, we’re much more confident about our prediction at A than at C.

Remark: It seems that the farther we are from the decision boundary, the more confident we are on our prediction.

Thus, in order to find a hyperplane that will divide our training data into 2 samples as best as possible, we need to make this hyperplane the farthest as possible from each sample.

To do so, we will introduce the notion of margin.

II) Notation

Let:

x be a feature vector (i.e., the input of the SVM).
$x \in R^{n}$ , where
$n$ is the dimension of the feature vector.
y be the class (i.e., the output of the SVM).
$y \in {- 1, 1}$ , i.e. the classification task is binary).
w and b be the parameters of the SVM: we need to learn them using the training set.
(
$x^{(i)}$ ,
$y^{(i)}$ ) be the
$i^{t h}$ sample in the dataset. Let's assume we have
$N$ samples in the training set.

The class

y

is determined as follows:

y^{(i)} = {\begin{cases} - 1 & i f w^{T} x^{(i)} + b \leq - 1 \\ 1 & i f w^{T} x^{(i)} + b \geq 1 \end{cases}

which can be more concisely written as

y^{(i)} (w^{T} x^{(i)} + b) \geq 1

The margins are the hyperplanes with the following equations:

{\begin{cases} w \cdot x + b = 1 (s a m e a s w^{T} x + b = 1) \\ w \cdot x + b = - 1 (s a m e a s w^{T} x + b = - 1) \end{cases}

To make the decision boundary as far as possible from each sample, we need to maximize the margin.

III) Goal

The SVM aims at satisfying 2 requirements:

The SVM should maximize the distance between the two decision boundaries (maximize the margin). Mathematically, that means that we want to maximize the distance between the 2 hyperplanes defined by
$w^{T} x + b = - 1$ and by
$w^{T} x + b = 1$ . This distance is equal to
$\frac{2}{‖ w ‖}$ (Explanation of why
$\frac{2}{‖ w ‖}$ ). This means we want to solve
$m a x \frac{2}{‖ w ‖}$ . Equivalently, we want to solve
$m i n \frac{1}{2} {‖ w ‖}^{2}$ . (Explanation of why
$m i n \frac{1}{2} {‖ w ‖}^{2}$ ).
The SVM should also correctly classify all
$x^{(i)}$ , which means:

$y^{(i)} (w^{T} x^{(i)} + b) \geq 1, \forall i \in {1, \dots, N}$

Which leads us to the following quadratic optimization problem:

\begin{matrix} min_{w, b} \frac{1}{2} ‖ w ‖^{2} \\ s . t . y^{(i)} (w^{T} x^{(i)} + b) \geq 1, \forall i \in {1, \dots, N} \end{matrix}

This is the hard-margin SVM, as this quadratic optimization problem admits a solution only if the data is linearly separable.

One can relax the constraints by introducing so-called slack variables

ξ^{(i)}

. Note that each sample of the training set has its own slack variable. This gives us the following quadratic optimization problem:

\begin{matrix} min_{w, b} \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{N} ξ^{(i)} \\ s . t y^{(i)} (w^{T} x^{(i)} + b) \geq 1 - ξ^{(i)}, & \forall i \in {1, \dots, N} \\ ξ^{(i)} \geq 0, & \forall i \in {1, \dots, N} \end{matrix}

This is the soft-margin SVM. C is a hyperparameter called penalty of the error term. (Explanation of slack variables / C influence).

One can add even more flexibility by introducing a function

ϕ

that maps the original feature space to a higher dimensional feature space. This allows non-linear decision boundaries (wiggly shape). The quadratic optimization problem becomes:

\begin{matrix} min_{w, b} \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{N} ξ^{(i)} \\ s . t . y^{(i)} (w^{T} ϕ (x^{(i)}) + b) \geq 1 - ξ^{(i)}, & \forall i \in {1, \dots, N} \\ ξ^{(i)} \geq 0, & \forall i \in {1, \dots, N} \end{matrix}

IV) Optimization

The quadratic optimization problem can be transformed into another optimization problem named the Lagrangian dual problem. (Explanation of why using Lagrangian dual problem)

Case 1: Hard Margin SVM

We define the Lagrangian

L (w, b)

as follow:

L (w, b, α) = f (w) + \sum_{i = 1}^{l} α_{i} h_{i} (w, b)

with

α_{i}

, the Lagrangian multipliers.

The function for Primal problem is:

m i n_{w, b} L (w, b, α) = \frac{1}{2} ‖ w ‖^{2} - \sum_{i = 1}^{m} α_{i} [y^{(i)} (w^{T} x^{(i)} + b) - 1]

In order to minimize it, we need to differentiate which respect to

w, b

and set their derivatives to 0, we obtain the function for Dual problem. (Explanation of how to get Primal / Dual problem)

The function for Dual problem (Wolf dual problem) is:

\begin{array}{ll} {max}_{α} W (α) = \sum_{i = 1}^{m} α_{i} - \frac{1}{2} \sum_{i = 1}^{m} \sum_{j = 1}^{m} y^{(i)} y^{(j)} α_{i} α_{j} x^{(i)^{T}} x^{(j)} \\ s.t. \forall i : α_{i} \geq 0 \\ \sum_{i = 1}^{m} y^{(i)} α_{i} = 0 \end{array}

(Explanation of why bother with the dual problem + why maximize alpha)

Case 2: Soft Margin SVM

We define the Lagrangian

L (w, b)

as follow:

L (w, b, ζ, α, β) = f (w, ζ) + \sum_{i = 1}^{l} α_{i} h_{i} (w, b, ζ) + \sum_{i = 1}^{l} β_{i} g_{i} (ζ)

with

α_{i}, β_{i}

, the Lagrangian multipliers.

The function for Primal problem is:

m i n_{w, b, ζ} L (w, b, ζ, α, β) = \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{m} ζ_{i} - \sum_{i = 1}^{m} α_{i} [y^{(i)} (w^{T} x^{(i)} + b) - 1 + ζ_{i}] - \sum_{i = 1}^{m} β_{i} ζ_{i}

In order to minimize it, we need to differentiate which respect to

w, b, ζ

and set their derivatives to 0, we obtain the function for Dual problem.

The function for Dual problem (Wolf dual problem) is:

\begin{array}{ll} m a x_{α} W (α) = \sum_{i = 1}^{m} α_{i} - \frac{1}{2} \sum_{i = 1}^{m} \sum_{j = 1}^{m} y^{(i)} y^{(j)} α_{i} α_{j} x^{(i)^{T}} x^{(j)} \\ s.t. \forall i : 0 \leq α_{i} \leq C \\ \sum_{i = 1}^{m} y^{(i)} α_{i} = 0 \end{array}

V) Making a prediction

Once the

α^{(i)}

are learned, one can predict the class of a new sample with the feature vector

x^{(t e s t)}

as follows:

\begin{array}{ll} y^{test} & = sign (w^{* T} x^{test} + b^{*}) \\ = sign (\sum_{i = 1}^{N} α^{(i)} y^{(i)} (x^{(i)})^{T} x^{test} + b^{*}) \\ = sign (\sum_{i = 1}^{N} α^{(i)} y^{(i)} ⟨ x^{(i)}, x^{test} ⟩ + b^{*}) \end{array}

The

α^{(i)} \geq 0

are called support vectors.

VI) Kernel Trick

Now, the Soft Margin SVM can handle the non-linearly separable data caused by noisy data. What if the non-linear separability is not caused by the noise? What if the data are characteristically non-linearly separable? Can we still separate the data using SVM? The answer is yes. And we’ll talk about a technique called kernel trick to deal with this.

Image you have a two-dimensional non-linearly separable dataset, you would like to classify it using SVM. It looks impossible because the data is not linearly separable. However, if we transform the two-dimensional data to a higher dimension, say, three-dimension or even ten-dimension, we would be able to find a hyperplane to separate the data.

The problem is, if we have a large dataset containing, say, millions of examples, the transformation will take a long time to run, let alone the calculations in the later optimization problem. Let’s revisit the Wolfe dual problem (for hard-margin SVM case):

\begin{array}{ll} m a x_{α} W (α) = \sum_{i = 1}^{m} α_{i} - \frac{1}{2} \sum_{i = 1}^{m} \sum_{j = 1}^{m} y^{(i)} y^{(j)} α_{i} α_{j} x^{(i)^{T}} x^{(j)} \\ s.t. \forall i : α_{i} \geq 0 \\ \sum_{i = 1}^{m} y^{(i)} α_{i} = 0 \end{array}

To solve this problem, we actually only care about the result of the dot product

⟨ x_{i}, x_{j} ⟩

. If there is a function which could calculate the dot product and the result is the same as when we transform the data into higher dimension, it would be fantastic. This function is called a kernel function.

Let's denote

K (x^{(i)}, x^{(j)}) = ⟨ x^{(i)}, x^{(j)}, ⟩

, the kernel function. We rewrite the Wolfe dual problem:

\begin{array}{ll} m a x_{α} W (α) = \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} y^{(i)} y^{(j)} α_{i} α_{j} K (x^{(i)}, x^{(j)}) \\ s.t. \forall i : α_{i} \geq 0 \\ \sum_{i = 1}^{n} y^{(i)} α_{i} = 0 \end{array}

There are multiples kernels such as:

Polynomial kernel (of degree
$d$ and coefficient
$r$ ):

$K_{P} (x_{i}, x_{j}) = (⟨ x_{i}, x_{j} ⟩ + r)^{p}$
Gaussian kernel (RBF kernel):

$K_{Gauss} (x_{i}, x_{j}) = e^{\frac{- γ ‖ x_{i} - x_{j} ‖^{2}}{2 σ^{2}}}$
Sigmoid kernel (
$γ$ = how much influence single training examples have):

$K_{tanh} (x_{i}, x_{j}) = \tanh (γ ⟨ x_{i}, x_{j} ⟩ - r)$

VII) Choosing parameters

Choosing C (recall that C =

\frac{1}{λ}

If C is large, then we get higher variance/lower bias.
If C is small, then we get lower variance/higher bias.

For the Polynomial kernel, a

p

value with 1 is just the linear kernel (SVM without kernel). A larger value of

p

will make the decision boundary more complex and might result in overfitting.

The other parameter we must choose is

γ = \frac{1}{2 σ^{2}}

from the Gaussian Kernel function:

With a large
$γ$ , the SVM decision boundary will simply be dependent on just the points that are closest to the decision boundary, effectively ignoring points that are farther away (high variance / low bias).
With a small
$γ$ will result in a decision boundary that will consider points that are further from it (low variance / high bias).

VIII) Logistic Regression vs SVM

Let's denote

n

the number of features and

m

the number of training
examples.

If
$n$ is large (relative to
$m$ ), then use logistic regression, or SVM without a kernel (the "linear kernel"). (E.g
$n \geq m, n = 10000, m = 10 \sim 1000$ ). We don't have enough examples to use a complex polynomial hypothesis.
If
$n$ is small and
$m$ is intermediate, then use SVM with a Gaussian Kernel. (E.g
$n = 1 \sim 1000, m = 10 \sim 10000$ ). We have enough examples that we may need a complex non-linear hypothesis.
If
$n$ is small and
$m$ is large, then manually create/add more features, then use logistic regression or SVM without a kernel. (E.g
$n = 1 \sim 1000, m = 50000 +$ ). We want to increase our features so that logistic regression becomes applicable.

IX) Bonus

Explanation of why
$\frac{2}{‖ w ‖}$ :

Let

x_{0}

be a point in the hyperplane

w^{T} x + b = - 1

. To measure the distance between

w^{T} x + b = - 1

w^{T} x + b = 1

, we only need to compute the perpendicular distance from

x_{0}

to plane

w^{T} x + b = 1

, denoted as

m

. Since

\vec{w}

is an unitary vector,

\vec{w} = \frac{w}{‖ w ‖}

Thus, we have:

w^{T} (x_{0} + m \frac{w}{‖ w ‖}) + b = 1

Expanding this equation, we have:

\begin{array}{ll} w^{T} x_{0} + m \frac{w^{T} w}{‖ w ‖} + b = 1 \\ w^{T} x_{0} + m \frac{‖ w ‖^{2}}{‖ w ‖} + b = 1 \\ w^{T} x_{0} + b + m ‖ w ‖ = 1 \\ - 1 + m ‖ w ‖ = 1 \\ m = \frac{2}{‖ w ‖} \end{array}

Remark:

w^{T} w = ⟨ w, w ⟩ a n d ⟨ w, w ⟩ = ‖ w ‖ * ‖ w ‖ * c o s (0) = ‖ w ‖^{2}

Explanation of why
$m i n \frac{1}{2} {‖ w ‖}^{2}$ :

m a x \frac{2}{‖ w ‖} \sim m a x \frac{1}{‖ w ‖} \sim m i n \frac{1}{2} ‖ w ‖^{2}

(we squared to get rid of the square root in the norm)

Explanation of slack variables / C influence:

Even if the underlying process which generates the features for the two classes is linearly separable, noise can make the data not separable. The introduction of slack variables to relax the requirement of linear separability solves this problem. The trade-off between accepting some errors and a more complex model is weighted by a parameter C.

C gives you control of how the SVM will handle errors. If we set C to positive infinite, we will get the same result as the Hard Margin SVM. However, if we set C to 0, there will be no constraint anymore, and we will end up with a hyperplane not classifying anything.

The rules of thumb are:

small values of C will result in a wider margin, at the cost of some misclassifications.
large values of C will give you the Hard Margin classifier and tolerates zero constraint violation.

We need to find a value of C which will not make the solution be impacted by the noisy data.

Explanation of why using Lagrangian dual problem:

Our goal was to solve this quadratic optimization problem:

\begin{matrix} {m i n}_{w, b} \frac{1}{2} ‖ w ‖^{2} \\ s . t . y^{(i)} (w^{T} x^{(i)} + b) \geq 1, \forall i \in {1, \dots, N} \end{matrix}

In order to find the solution to an optimization problem with constraints, we will be using the Lagrange Multipliers. It enable us to have a new expression that we can minimize/maximize without thinking of the constraint. So our goal was to minimize

\frac{1}{2} ‖ w ‖^{2}

, thus we have to minimize the function of primal problem.

Explanation of how to get Primal / Dual problem: (Explain how to get b too)

Lagrange stated that if we want to find the minimum of

f

under inequality constraint g, we just need to solve for:

\begin{array}{ll} \nabla L (x^{*}) = \nabla f (x^{*}) - α \nabla g (x^{*}) = 0 \\ w i t h α, t h e L a g r a n g e m u l t i p l i e r s \end{array}

Such that

x^{*}

(as an extremum for the optmization problem) has to satisfy all the KKT conditions.

KKT Conditions

Primal Feasibility:
$g (x^{*}) \leq 0$
Dual Feasibility:
$α \geq 0$
Complementary Slackness:
$α g (x^{*}) = 0$
Lagrangian Stationarity:
$\nabla f (x^{*}) - α \nabla g (x^{*}) = 0$

Initially, we have this:

\begin{matrix} {m i n}_{w, b} \frac{1}{2} ‖ w ‖^{2} \\ s . t . y^{(i)} (w^{T} x^{(i)} + b) \geq 1, \forall i \in {1, \dots, N} \end{matrix}

By applying, the Primal Feasibility, we have the Primal problem:

m i n_{w, b} L (w, b, α) = \frac{1}{2} ‖ w ‖^{2} - \sum_{i = 1}^{m} α_{i} [y^{(i)} (w^{T} x^{(i)} + b) - 1]

By applying the Lagrangian Stationarity condition, we will have:

\begin{array}{ll} \nabla_{w} L (w, b, α) = w - \sum_{i = 1}^{m} α_{i} y_{i} x_{i} = 0 ⟹ w = \sum_{i = 1}^{m} α_{i} y_{i} x_{i} \\ \nabla_{b} L (w, b, α) = - \sum_{i = 1}^{m} α_{i} y_{i} = 0 ⟹ \sum_{i = 1}^{m} α_{i} y_{i} = 0 \end{array}

Substituting to Lagrangian function

L

, we get the Wolfe dual problem:

W (α) = \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} y^{(i)} y^{(j)} α_{i} α_{j} x^{(i)^{T}} x^{(j)}

By applying the Dual Feasibility, we thus have:

\begin{array}{ll} m a x_{α} W (α) = \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} y^{(i)} y^{(j)} α_{i} α_{j} x^{(i)^{T}} x^{(j)} \\ s.t. \forall i : α_{i} \geq 0 \\ \sum_{i = 1}^{n} y^{(i)} α_{i} = 0 \end{array}

After finding the

α

such that the function

W

is maximized. We can then use the

α

and plugged it here

w^{*} = \sum_{i = 1}^{m} α_{i} y_{i} x_{i}

to find

w^{*}

w

tells us the direction of the hyperplane. See figure below.

Now, we need to find

b^{*}

because it will tells us the position of the hyperplane.

To do so, we apply Complementary Slackness condition:

\begin{array}{ll} α g (x) = 0 \\ α [y^{(i)} (w^{* T} x^{(i)} + b^{*}) - 1] = 0 \\ (y^{(i)})^{2} (w^{* T} x^{(i)} + b^{*}) = y^{(i)} \\ (w^{* T} x^{(i)} + b^{*}) = y^{(i)} \\ b^{*} = \frac{1}{S} \sum_{i = 1}^{S} (y_{i} - w^{* T} x^{(i)}) \end{array}

Remark:

(y^{(i)})^{2} = 1

because

y^{(i)}

is a vector full of -1 or 1.

S

is equal to the number of support vectors.

Another way is to first find the worst (closest) positive example (

m i n_{i : y^{(i)} = 1} w^{* T} x^{(i)}

) and the worst (closest) negative example (

m a x_{i : y^{(i)} = - 1} w^{* T} x^{(i)}

). Then divide by 2 to get the position of the hyperplane.

Thus,

b^{*} = \frac{m a x_{i : y^{(i)} = - 1} w^{* T} x^{(i)} - m i n_{i : y^{(i)} = 1} w^{* T} x^{(i)}}{2}

Explanation of why bother with the dual problem + why maximize alpha:

We bother with the dual problem because it enables us to have equation which only depends on

α^{(i)}

since we know that most of the

α^{(i)} = 0

except the support vectors one (

α^{(i)} \geq 0

). This term is very efficient to calculate if there are only few support vectors. Further, since we now have a scalar product only involving data vectors, we may apply the kernel trick.

By maximizing

α

in the dual problem, we will be able to minimize the primal problem which respect to

w

b

Support Vector Machine

I) Intuition

II) Notation

III) Goal

IV) Optimization

V) Making a prediction

VI) Kernel Trick

VII) Choosing parameters

VIII) Logistic Regression vs SVM

IX) Bonus

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation