# Support Vector Machine(SVM)
###### tags: `Maching Learning` `SVM` `SVR` `林軒田`
## Introduction
In the perceptrum learning algorithm(PLA), we train the model by the error of each sample.

By doing do, we can find a hyperplane to separate all samples. However, this algorithm changes the slop slightly when error occured. So the hyperplane might not best.

Above three results are possible output of PLA. Generally, we prefer the rightmost one as our model because of the larger distance between data and separation line.

We can consider the data has a little variance result from noise. For the leftmost model, if the data has a little shift, the result of the model may wrong. In contract, the rightmost model has very large margin to tolerate the shift. In other word, it can tolerate more noise and has good robustness.
## SVM Introduction
So, we try to increase the margin of hyperplane as large as possible.
$$
\max_w{margin(w)}\\
subject~to~w~classifies~every~(x_n, y_n)~correctly\\
where~the~margin(w)=min_{n=1,...,N}{distance(x_n,w)}
$$
the margin is the distance between the separation line and cloest samples.

Next, we need to take care the $distance(x_n,w)$. The distance can be calculated by the projection. Consider $x',x''$ are points on the hyperplan $w^Tx'+b=0$. Now, we have a new sample $x$. The distance of $\mathbf{x}$ and hyperplane is
$$
distance(\mathbf{x},b,w)=|\frac{w^T}{||w||}(\mathbf{x}-x')|,~where~w~is~the~normal~vector~of~hyperplane\\
=\frac{1}{||w||}|w^T\mathbf{x}+b|
$$
For a binary classification, $y_n$ is +1 or -1. So the correctness of classification can be replaced by
$$
every~y_nw^Tx_n>0
$$
which means the predicted output $w^Tx_n$ and real output $y_n$ has same sign.
However, normal vector $w$ could belong to infinite number of hyperplane. For SVM, we want to find a specific hyperplane $w^Tx'+b=0$ which can separate data correctly. So, the constraint can further replaced by
$$
y_n(w^Tx_n+b)>0
$$
Now, the optimal problem becomes
$$
\max_w{\frac{1}{||w||}|w^Tx_n+b|}\\
subject~to~y_n(w^Tx_n+b)>0\\
$$
However, the distance, $\frac{1}{||w||}|w^Tx_n+b|$, is hard to optimize. So we can replace it by
$$
distance(x_n,b,w)=\frac{1}{||w||}y_n(w^Tx_n+b)
$$
Because $y_n=\pm1$ and y_n(w^Tx_n+b)>0, we can replace abslute value operator.
Next, we want to further simplify the objective function. Because the scaling of normal vector does not change the result, we can find a special scaling factor which make
$$
\min_{n=1,...,N}{y_n(w^Tx_n+b)}=1
$$
Therefore, the optimal problem becomes
$$
\max_w{\frac{1}{||w||}}\\
subject~to~\min_{n=1,...,N}{y_n(w^Tx_n+b)}=1~or~y_n(w^Tx_n+b)\geq1
$$
Finally, modify the objective function. Problem becomes
$$
\min_w{\frac{1}{2}w^Tw}\\
subject~to~y_n(w^Tx_n+b)\geq1
$$
## SVM solving
Optimal problem above is already solvable. However, quadratic programming(QP) is a convenient tool to solve optimal problem. The standard form of QP is
$$
optimal~u~by~QP(Q,\mathbf{p},\mathbf{A},\mathbf{c})\\
\min_u{\frac{1}{2}\mathbf{u}^TQ\mathbf{u}}+\mathbf{p}^T\mathbf{u}\\
subject~to~\mathbf{a}_m^T\mathbf{u}\geq{c_m},~for~m=1,2,...,M
$$
[reference]: https://www.mathworks.com/help/optim/ug/quadprog.html "Matlab Quadratic Programming"
Compare with SVM
$$
u=\begin{bmatrix} b\\\mathbf{w} \end{bmatrix}\\
Q=\begin{bmatrix}
0 & \mathbf{0}_d^T \\
\mathbf{0}_d & \mathbf{I}_d
\end{bmatrix};
\mathbf{p}=\mathbf{0}_{d+1};\mathbf{a}_n^T=y_n\begin{bmatrix} 1 & \mathbf{x}_n^T \end{bmatrix};c_n=1
$$
Here, $d$ is dimension of samples and $n$ is the number of samples.
Input those parameters into QP solver, $\begin{bmatrix}b\\\mathbf{w}\end{bmatrix}$ will be return.
:::warning
However, if dimension of samples, $d$, is very large or infinite, it will cost lots of time. Is it possible to make SVM independent from $d$?
:::
Simplify the optimal problem by introducing the Lagrange multipliers which can make constrained problem become unconstrained.
$$
\mathcal{L}(b,\mathbf{w},\alpha)=\frac{1}{2}\mathbf{w}^T\mathbf{w}+\sum_{n=1}^N{\alpha_n(1-y_n(\mathbf{w}^T\mathbf{x_n}+b))}
$$
Therefore, SVM becomes
$$
SVM\equiv\min_{b,w}(\max_{all~\alpha_n\geq0}\mathcal{L}(b,\mathbf{w},\mathbf{\alpha}))
$$
:::info
### What is Lagrange multiplier, $\alpha$, and how to interpret the physical meaning?
For a objective function $f(x,y)$ with constraint $g(x,y)\geq c$, optimal result occurs at $\nabla f+\alpha \nabla g=0$ or $\nabla f=-\alpha \nabla g$ which means these two function has same normal vector. Note that $\alpha\le 0$ when we want to maximize $f(x,y)$

From the above equation,
$$
\frac{\partial f}{\partial x}=\alpha \frac{\partial g}{\partial x}\\
\frac{\partial f}{\partial y}=\alpha \frac{\partial g}{\partial y}
$$
Now, we want to know the objective function changing rate w.r.t. constraint $c$
$$
\frac{\partial f}{\partial c}=\frac{\partial f}{\partial x}\frac{\partial x}{\partial c}+\frac{\partial f}{\partial y}\frac{\partial y}{\partial c}\\
=\alpha \frac{\partial g}{\partial x}\frac{\partial x}{\partial c}+\alpha \frac{\partial g}{\partial y}\frac{\partial y}{\partial c}
$$
Finally,
$$
\frac{\partial f}{\partial c}=\alpha \frac{\partial g}{\partial c}\\
\alpha=\frac{\partial f}{\partial g}
$$
Lagrange multipiler means how much objective function change when constraint change a unit.
[reference]: https://www.youtube.com/watch?v=q46q7TqXMQM "Interpreting the Lagrange multiplier"
[reference]: https://www.youtube.com/watch?v=8mjcnxGMwFo "Lagrange multipliers"
:::
From the Lagrange duality,
$$
\min_{b,w}(\max_{all~\alpha_n\geq0}\mathcal{L}(b,\mathbf{w},\mathbf{\alpha}))\geq \max_{all~\alpha_n\geq0}(\min_{b,w}\mathcal{L}(b,\mathbf{w},\mathbf{\alpha}))
$$
Left hand side is Lagrange dual problem. Equation will hold when the optimization problem has strong duality
- convex primal
- feasible primal(true if samples are separable)
- linear constraints
So, we can solve right-hande side optimization problem directly. By doing this, we can replace $b,w$ by KKT condition and solve single variable, $\alpha$, optimization problem.
:::info
### Karush-Kuhn-Tucker(KKT) condition
For a constrained optimization problem
$$
\min_{\mathbf{x}}f(\mathbf{x})\\
subject~to~g(\mathbf{x})\le0
$$
We can form a Lagrange equation
$$
\mathcal{L}(\mathbf{x},\alpha)=f(\mathbf{x})+\alpha g(\mathbf{x})
$$
and then we can calculate partial differential of $\mathcal{L}$ by $\mathbf{x}, \alpha$.
$$
\nabla_\mathbf{x}\mathcal{L}=\frac{\partial \mathcal{L}}{\partial \mathbf{x}}=\nabla f+\alpha \nabla g=0\\
\nabla_\alpha \mathcal{L}=\frac{\partial \mathcal{L}}{\partial \alpha}=g(\mathbf{x})=0
$$
Second equation is constraint equations. Now, we focus on the first equation.
If a optimal solution $\mathbf{x}^*$ satisfy the constraint. There has two situations.
1. internal solution : constraint is not active. Constraint optimization problem becomes unconstraint optimization problem. Therefore, optimal result $\mathbf{x}^*$ satisfy $\nabla f=0$ and $\alpha =0$.
2. boundary solution : constraint is active which means $g(\mathbf{x}^*)=0$. There exits a $\alpha$ makes $\nabla f=-\alpha \nabla g$.
:bulb: Negative sign before $\alpha$ is meaning. We try to minimize $f$. $\nabla f$ suppost pointing to the internal of feasible region $K=\{ \mathbf{x} \in \mathbb{R} |g(\mathbf{x})\le0 \}$. $\nabla g$ points to the external of $K$, $g(\mathbf{x})>0$ region. So $\alpha \geq 0$.
In conclusion, either internal or external solution, $\alpha g(\mathbf{x})$ is held.
$$
\nabla_x{\mathcal{L}}=\nabla f+\alpha \nabla g=0~~~(stationary~quation常駐方程式)\\
g(\mathbf{x})\le 0~~~(primal~feasibility原始可行性)\\
\alpha \geq 0~~~(dual~feasibility對偶可行性)\\
\alpha g(\mathbf{x})=0~~~(complementary~slackness補鬆弛)
$$
Above four conditions are called KKT condition. If we want to maximize $f(\mathbf{x})$ constrainted by $g(\mathbf{x})\le 0$, dual feasibility needs to be modified by $\alpha \le 0$
[reference]: https://ccjou.wordpress.com/2017/02/07/karush-kuhn-tucker-kkt-%E6%A2%9D%E4%BB%B6/ "KKT conditions"
:::
For SVM, after apply KKT conditions
$$
primal~feasible:y_n(\mathbf{w}^T\mathbf{x}_n +b)\geq 1\\
dual~feasible:\alpha_n \geq0\\
dual-inner~optimal:\sum y_n\alpha_n =0(from~\frac{\partial \mathcal{L}}{\partial b}=0);\mathbf{w}=\sum \alpha_n y_n\mathbf{x}_n(from~\frac{\partial \mathcal{L}}{\partial w_i})\\
primal-inner~optimal:\alpha_n(1-y_n(\mathbf{w}^T \mathbf{x}_n +b))=0
$$
Orignal SVM optimization problem
$$
\max_{all~\alpha_n\geq0}(\min_{b,w}\frac{1}{2}\mathbf{w}^T\mathbf{w}+\sum_{n=1}^N{\alpha_n(1-y_n(\mathbf{w}^T\mathbf{x_n}+b))})
$$
becomes
$$
\max_{all~\alpha_n \geq 0,\sum y_n\alpha_n=0,\mathbf{w}=\sum \alpha_n y_n \mathbf{x}_n}-\frac{1}{2}||\sum_{n=1}^N \alpha_n y_n\mathbf{x}_n||^2+\sum_{n=1}^N \alpha_n
$$
signal variable, $\alpha$, unconstraint optimization problem. Optimal $b, \mathbf{w}$ can be solved by optimal $\alpha$.
$$
\min_\alpha \frac{1}{2}\sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m \mathbf{x}_n^T \mathbf{x}_m-\sum_{n=1}^N \alpha_n\\
subject~to~\sum_{n=1}^N y_n \alpha_n=0;\alpha_n \geq0
$$
:::warning
Do you remember why we need to transfer to Lagrange dual problem and understant KKT conditions? At the begining, we solve SVM by QP with $d+1$($\mathbf{w}, b$) variables and $N$(sample points) constraints. By solving the dual problem, number of variables and constraints become $N$ and $N+1$, respectively.
However, do we really independent from input dimension, $d$?
Actually, the highest calculating complexity is at $\alpha_n \alpha_m y_n y_m \mathbf{x}_n^T \mathbf{x}_m$.
:::
Apply QP to optimize $\alpha$
$$
\min_\alpha \frac{1}{2}\alpha^TQ\alpha+\mathbf{p}^T\alpha\\
subject~to~\mathbf{a}_i^T \alpha \geq c_i\\
q_{n,m}=y_ny_m\mathbf{x}_n^T \mathbf{x}_m\\
\mathbf{p}=-\mathbf{1}_{N}\\
\mathbf{a}_{\geq}=y,\mathbf{a}_{\le}=-y(from~\sum_{n=1}^N y_n \alpha_n=0)\\
c_n=0
$$
After we get optimal $\alpha$
$$
\mathbf{w}=\sum \alpha_n y_n \mathbf{x}_n\\
b=y_n-\mathbf{w}^T \mathbf{x}_n~for~one~\alpha_n>0
$$
Here, we can observe that $\mathbf{w}$ is linear combination of input data. Also, $\mathbf{w},b$ only relative to the point on the boundary(constraint is active, $\alpha=0$) which means other points are not important when calculating SVM model. The points $\alpha >0$ are called support vectors.
## Kernel Trick
As mention before, QP can help us to solve the optimization problem when we transfer into specific form. Besides of original space $\mathbb{R}^d$, we can do linear or nonlinear mapping, $\mathbf{z}=\Phi(\mathbf{x})$, to another space, $\mathbb{R}^{\widetilde d}$.
### Polynomial kernel
For 2nd order polynomial transform, $\Phi_2(\mathbf{x})$
$$
\Phi_2(\mathbf{x})=(1, x_1, x_2, ..., x_d, x_1^2, x_1x_2, ...,x_1x_d, x_2^2, ..., x_d^2)
$$
The kernel function, $K_{\Phi}(\mathbf{x},\mathbf{x}')$, will be
$$
K_{\Phi}(\mathbf{x},\mathbf{x}')=\Phi_2(\mathbf{x})^T\Phi_2(\mathbf{x'})=1+\sum_{i=1}^dx_ix_i'+\sum_{i=1}^d \sum_{j=1}^d x_ix_jx_i'x_j'\\
=1+\sum_{i=1}^dx_ix_i'+1+\sum_{i=1}^dx_ix_i'\sum_{j=1}^dx_jx_j'\\
=1+\mathbf{x}^T\mathbf{x}'+(\mathbf{x}^T\mathbf{x}')(\mathbf{x}^T\mathbf{x}')
$$
So, the kernel is the operation of transformation and inner production. Back to the SVM, after the transformation,
$$
q_{n,m}=y_ny_m\mathbf{z}_n^T \mathbf{z}_m=y_ny_mK(\mathbf{x}_n, \mathbf{x}_m)\\
b=y_s-\mathbf{w}\mathbf{z}_s=y_s-(\sum_{n=1}^N \alpha_ny_n\mathbf{z}_n)^T\mathbf{z}_s=y_s-\sum_{n=1}^N \alpha_ny_nK(\mathbf{x}_n,\mathbf{x}_s),\\
~where~support~vector~SV(\mathbf{x}_s,y_s)\\
optimal~hypothesis~g_{SVM}(\mathbf{x})=sign(\mathbf{x} \Phi(\mathbf{x})+b)=sign(\sum_{n=1}^N\alpha_ny_nK(\mathbf{x}_n,\mathbf{x})+b)
$$
:::warning
Kernel trick is a way to avoid dependence on $\widetilde d$ with a efficient kernel functin. Original operation needs $O(d^2)$ at transformation and inner production. But after simplification, $\Phi_2^T\Phi_2$ can be done by combination of $\mathbf{x}^T\mathbf{x}$ which has complexity $O(d)$.
:::
The transformation can be generalized and get more compact form
$$
\Phi_2(\mathbf{x})=(1,\sqrt{2\gamma}x_1, ..., \sqrt{2\gamma}x_d, \gamma x_1^2, ..., \gamma x_d^2)\\
\Updownarrow\\
K_2(\mathbf{x},\mathbf{x}')=1+2\gamma \mathbf{x}^T\mathbf{x}'+\gamma^2(\mathbf{x}^T\mathbf{x}')^2
$$
After scaling, not only the geometry of SVM boundary is different but also has different support vectors.

Kernel can be expanded to higher order.
$$
K_Q(\mathbf{x},\mathbf{x}')=(\xi+\gamma \mathbf{x}^T\mathbf{x}')^Q~with~\gamma>0,\xi\geq0
$$
:::warning
:+1: Less restricted
:+1: strong physical control by degree $Q$
:-1: Numerical difficulty result from large $Q$
$$
|\xi+\gamma \mathbf{x}^T\mathbf{x}'|<1\Leftrightarrow K\rightarrow0\\
|\xi+\gamma \mathbf{x}^T\mathbf{x}'|>1\Leftrightarrow K\rightarrow \infty
$$
:-1: Difficult to choose parameter $(\gamma, \xi, Q)$
:::
:::info
Special case of polynomial kernel : **linear kernel**
$$
K_1(\mathbf{x},\mathbf{x}')=(0+1\cdot\mathbf{x}^T\mathbf{x}')^1
$$
which just usual inner product.
:+1: Safe, fast and explainable
:-1: Not always separable
:bulb: **Use simple model first!**
:::
### Gaussian kernel
Higher dimension of kernel may solve difficult problem. Is it possible to have infinite dimension of $\Phi(\mathbf{x})$?
$$
K(\mathbf{x},\mathbf{x}')=\exp(-(\mathbf{x}-\mathbf{x}')^2)\\
=\exp(-\mathbf{x}^2)\exp(-(\mathbf{x}')^2)\exp(2\mathbf{x}\mathbf{x}')\\
=\exp(-\mathbf{x}^2)\exp(-(\mathbf{x}')^2)(\sum_{i=0}^\infty \frac{(2xx')^i}{i!})\\
=\sum_{i=0}^\infty \exp(-\mathbf{x}^2) \sqrt{\frac{2^i}{i!}} (x)^i~\exp(-\mathbf{x'}^2) \sqrt{\frac{2^i}{i!}} (x')^i \\
=\Phi(x)^T\Phi(x')\\
where~\Phi(x)=\exp(-x^2)\cdot (1,\sqrt{\frac{2}{1!}}x, \sqrt{\frac{2^2}{2!}}x^2, ...)
$$
More general, Gaussian kernel
$$
K(\mathbf{x},\mathbf{x}')=\exp(-\gamma||\mathbf{x}-\mathbf{x}'||^2)~with~\gamma>0
$$
Therefore, the hypothesis of Gaussian SVM
$$
g_{SVM}(\mathbf{x})=sign(\sum_{SV}\alpha_ny_nK(\mathbf{x}_n,\mathbf{x})+b)\\
=sign(\sum_{SV}\alpha_ny_n\exp(-\gamma||\mathbf{x}-\mathbf{x}_n||^2)+b)\\
where~SV(\mathbf{x}_n,y_n)
$$
The hypothesis of Gaussian SVM is linear combination of Gaussians centered at support vectors.
Parameter $\gamma$ controls the shape of Gaussian. Larger $\gamma$ results sparper Gauussian, which may cause overfitting. So, $\gamma$ should be selected carefully.

:::warning
:+1: powerful, less numerical difficulty and only one parameter
:-1: hard to explain, slowe and may cause overfitting
:::
Kernal must satisfy Mercer's conditions
- Symmetric
- $K(\mathbf{x}_i, \mathbf{x}_j)$ must be positibe semi-definite
## Soft-Margin SVM
SVM may cause overfitting if we always want to separate input correctly. Therefore, can we give SVM a noise tolerance? In the perceptrun learning algorithm(PLA), we modify the model when error occurs and try to minimize the number of errors.
$$
Pocket~algorithm:\min_{b,\mathbf{w}}\sum_{n=1}^N[y_n\neq sign(\mathbf{w}^T \mathbf{z}_n+b)]
$$
Then, we can combine this algorithm with hard-margin SVM
$$
\min_{b,\mathbf{w}} \frac{1}{2} \mathbf{w}^T\mathbf{w}+C\sum_{n=1}^N [y_n\neq sign(\mathbf{w}^T \mathbf{z}_n+b)]\\
suject~to~y_n(\mathbf{w}^T \mathbf{z}_n+b)\geq 1~for~correct~n\\
y_n(\mathbf{w}^T \mathbf{z}_n+b)\geq -\infty(any~number)~for~incorrect~n
$$
Constant $C$ is a penalty of noise tolerance. If $C$ decrease, SVM will has larger margin. In contrast, SVM can endure more error. However, $[y_n\neq sign(\mathbf{w}^T \mathbf{z}_n+b)]$ is not linear constraint. QP can't be applied. Therefore, slack variable, $\xi$, is introduced. Slack variable is the value(distance) of margin violation. Each data point has own slack variable, $\xi_n$. If data point is classify correctly, $\xi_n=0$. Contrastly, $\xi_n>0$.

Soft-margin SVM can be expressed as
$$
\min_{b,\mathbf{w}} \frac{1}{2} \mathbf{w}^T\mathbf{w}+C\sum_{n=1}^N \xi_n\\
s.t.~y_n(\mathbf{w}^T \mathbf{z}_n+b)\geq 1-\xi_n;\xi_n\geq 0
$$
The optimization problem has $\widetilde d+1+N$($\mathbf{w},C,\xi_n$) variables and $2N$ constraints. At the hard-margin SVM, we solve Lagrange dual problem to remove the dependence on $\widetilde d$. Similarly,
$$
\mathcal{L}(b,\mathbf{w}, \mathbf{\xi}, \alpha, \beta)=\frac{1}{2} \mathbf{w}^T\mathbf{w}+C\sum_{n=1}^N \xi_n\\
+\sum_{n=1}^N \alpha_n \cdot \big (1-\xi_n-y_n(\mathbf{w}^T \mathbf{z}_n+b)\big )+\sum_{n=1}^N\beta_n \cdot (-\xi_n)
$$
and the objective function become
$$
\max_{\alpha_n\geq 0,\beta_n \geq 0} \Big ( \min_{b,\mathbf{w},\mathbf{\xi}} \frac{1}{2} \mathbf{w}^T\mathbf{w}+C\sum_{n=1}^N \xi_n
+\sum_{n=1}^N \alpha_n \cdot \big (1-\xi_n-y_n(\mathbf{w}^T \mathbf{z}_n+b)\big )+\sum_{n=1}^N\beta_n \cdot (-\xi_n) \Big )\\
\because \frac{\partial \mathcal{L}}{\partial \xi_n}=0=C-\alpha_n -\beta_n\\
\therefore SVM \equiv \max_{0\leq \alpha_n \leq C,\beta_n=C-\alpha_n} \Big ( \min_{b,\mathbf{w}} \frac{1}{2} \mathbf{w}^T\mathbf{w}+\sum_{n=1}^N \alpha_n \big (1-y_n(\mathbf(w)^T \mathbf{z}_n+b) \big ) \Big )
$$
Note that $\beta_n$ is canceled but the constraint needs to be reserved.
$$
\frac{\partial \mathcal{L}}{\partial \xi_n}=0=C-\alpha_n -\beta_n\\
\therefore \alpha_n + \beta_n = C;\alpha_n \geq 0;\beta_n \geq 0\\
\therefore 0 \leq \alpha_n \leq C
$$
Now, we already convert to Lagrange dual problem. Next, we can simplified the problem with KKT conditions.
$$
primal~feasible:y_n(\mathbf{w}^T\mathbf{z}_n +b)\geq 1-\xi_n\\
dual~feasible:0 \leq \alpha_n \leq C;\beta_n = C-\alpha_n\\
dual-inner~optimal:\sum y_n\alpha_n =0(from~\frac{\partial \mathcal{L}}{\partial b}=0);\mathbf{w}=\sum \alpha_n y_n\mathbf{z}_n(from~\frac{\partial \mathcal{L}}{\partial w_i})\\
primal-inner~optimal:\alpha_n(1-\xi_n-y_n(\mathbf{w}^T \mathbf{z}_n +b))=0
$$
Only different at the upper bound on $\alpha_n$ from hard-margin SVM. The optimization problem becomes
$$
\min_\alpha \frac{1}{2}\sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m \mathbf{z}_n^T \mathbf{z}_m-\sum_{n=1}^N \alpha_n\\
s.t.~\sum_{n=1}^N y_n \alpha_n=0;0 \leq \alpha_n \leq C\\
implicitly~\mathbf{w}=\sum_{n=1}^N \alpha_n y_n \mathbf{z}_n;\beta_n=C-\alpha_n
$$
:::warning
For the primal problem, it has $\widetilde d+1+N$($\mathbf{w},C,\xi_n$) variables and $2N$ constraints. But in dual problem, variables and constraints becomes $N$ and $2N+1$, respectively.
:::
This form can be solved by QP. Optimal $\alpha^*$ can be obtained. How to calculate $\mathbf{w}$ and $b$? Recall hard-margin SVM, we use support vector(data point $\alpha_n \neq0$) to calculate $\mathbf{w}$ and $b$ which is based on complementary slackness. In soft-margin SVM, we can also solve $\mathbf{w}$ and $b$ similarly. But the complementary slackness in soft-margin is
$$
\alpha_n(1-\xi_n-y_n(\mathbf{w}^T \mathbf{z}_n +b))=0\\
\beta_n \xi_n = (C-\alpha_n)\xi_n=0
$$
From the second comp. slackness,
$$
\xi_n = 0~when~\alpha_s<C
$$
which means the data points are on the boundary, called free support vectors.
And then $b$ can be solved
$$
b=y_s-y_s\xi_s-\mathbf{w}^T \mathbf{z}_s\\
=y_s-\sum_{free~SV}\alpha_n y_n K(\mathbf{x}_n,\mathbf{x}_s)~where~free~SV(\mathbf{x}_s,y_s)
$$


Larger $C$ means higher penalty of noise and less noise tolerance which may cause overfitting. So, we need to choose $\gamma$ and $C$ carefully.
# SVR
SVM is a very powerful classification tool. The concept of SVM can also expand to regression problem. In linear regression, we try to minimize the error by least square error method. But in SVM, error measuring
has some difference.

The advantage of SVM is the tolerence of noise by the margin. Therefore, the error becomes
$$
err(y,s)=\max(0, |s-y|-\epsilon), if \begin{cases} |s-y|\leq \epsilon:0 \\ |s-y|>\epsilon:|s-y|-\epsilon\end{cases}
$$
Here, $s$ is the predict output($\mathbf{w}^T\mathbf{x}+b$) of model and $\epsilon$ is the width of margin. If data point in the purple tube, error is $0$. In contrast, the error will be the red line.
Compare with squared regression, tube regression is very similar with squared regression when error, $|s-y|$ is small and less outliers. Because of the squared error, squared regression has greater effected by outliers.

The L2-regularized tube regression can be written as
$$
\min_{b,\mathbf{w}} \frac{1}{2} \mathbf{w}^T \mathbf{w} + C \sum_{n=1}^N \max \big ( 0,|\mathbf{w}^T \mathbf{z}_n + b - y_n| - \epsilon \big )
$$
As before, we need to convert to dual problem to remove the dependence of $\widetilde d$. However, the constrain is not linear. We can remove the $\max$ operator with inequation.
$$
\min_{b,\mathbf{w},\xi_n^\vee,\xi_n^\wedge} \frac{1}{2} \mathbf{w}^T \mathbf{w} + C \sum_{n=1}^N (\xi_n^\wedge + \xi_n^\vee)\\
s.t. \begin{cases}
-\epsilon-\xi_n^\vee \leq y_n - \mathbf{w}^T \mathbf{z}_n - b \leq \epsilon+\xi_n^\wedge \\
\xi_n^\vee \geq 0; \xi_n^\wedge \geq 0
\end{cases}
$$
Where the first inequation can separate into two inequation
$$
\begin{cases}
y_n - \mathbf{w}^T \mathbf{z}_n - b \leq \epsilon+\xi_n^\wedge \\
-\epsilon-\xi_n^\vee \leq y_n - \mathbf{w}^T \mathbf{z}_n - b
\end{cases}
$$
Now, we can ultilize QP to help us to solve $\widetilde d + 1 +2N$($\mathbf{w}+C+\xi_n$) variables with $2N+2N$ constraints.

Next, we are going to convert to Lagrange dual problem to reduce the complexity.
$$
primal~feasible: -\epsilon-\xi_n^\vee \leq (y_n - \mathbf{w}^T\mathbf{z}_n -b)\leq \epsilon+\xi_n^\wedge\\
dual~feasible:\begin{cases} 0 \leq \alpha_n^\wedge \leq C \\
0 \leq \alpha_n^\vee \leq C \\
\beta_n^\wedge = C-\alpha_n^\wedge \\
\beta_n^\vee = C-\alpha_n^\vee
\end{cases}\\
dual-inner~optimal: \begin{cases} \sum_{n=1}^N(\alpha_n^\wedge - \alpha_n^\vee) =0(from~\frac{\partial \mathcal{L}}{\partial b}=0)\\
\mathbf{w}=\sum (\alpha_n^\wedge - \alpha_n^\vee)\mathbf{z}_n(from~\frac{\partial \mathcal{L}}{\partial w_i}) \end{cases}\\
primal-inner~optimal:\begin{cases} \alpha_n^\wedge(\epsilon+\xi_n^\wedge-y_n+\mathbf{w}^T \mathbf{z}_n +b)=0 \\
\alpha_n^\vee(\epsilon+\xi_n^\vee+y_n-\mathbf{w}^T \mathbf{z}_n -b)=0
\end{cases}
$$
Therefore, the optimization problem becomes
$$
\min \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N (\alpha_n^\wedge - \alpha_n^\vee)(\alpha_m^\wedge - \alpha_m^\vee)k_{n,m}+\sum_{n=1}^N \Big ( (\epsilon-y_n) \cdot \alpha_n^\wedge + (\epsilon+y_n) \cdot \alpha_n^\vee \Big )\\
s.t. \begin{cases} \sum_{n=1}^N1 \cdot(\alpha_n^\wedge - \alpha_n^\vee)=0\\
0 \leq \alpha_n^\wedge \leq C;0 \leq \alpha_n^\vee \leq C\end{cases}
$$
Then, $\alpha_n^\wedge$ and $\alpha_n^\vee$ can be solved by QP solver. And then $\mathbf{w}$ can be calculate by the combination of datas. When data point in the tube,
$$
\begin{cases} \xi_n^\wedge = 0;\xi_n^\vee = 0 (\text{no error}) \\
\alpha_n^\wedge = 0; \alpha_n^\vee=0 (\text{comp. slackness})\\
(\epsilon+\xi_n^\wedge-y_n+\mathbf{w}^T \mathbf{z}_n +b) \neq 0;(\epsilon+\xi_n^\vee+y_n-\mathbf{w}^T \mathbf{z}_n -b) \neq 0\\
\beta_n = 0
\end{cases}
$$
which means when $\beta_n \neq 0$, data point is on or outside tube. Thoses points are call support vecotrs of SVR.