2-1: Linear Classifiers

# 2-1: Linear Classifiers Linear classifiers are classifiers of linear function form. ## Perceptron We'll cover the most basic classification algorithm, Perceptron, in this chapter. Perceptron is an online linear classification algorithm. Let's first set up the binary classification problem setting: * $\mathcal{X}=\mathbb{R}^d$, $\mathcal{Y}=\{\pm 1\}$ * Loss function: 0-1 loss, $l(\hat{y}, y)=\mathbb{I}\{\hat{y} \neq y\}$ * Hypothesis space: All linear functions $\mathcal{H}=\{h_{\textbf{w}}\mid \textbf{x} \rightarrow \text{sign}(\langle\textbf{w}, \textbf{x} \rangle) \}$. The dataset $\mathcal{D}^n=\{(\textbf{x}_1, y_1), ..., (\textbf{x}_n, y_n)\}$ is *linearly seperable*, if $\exists \textbf{w}^*$ s.t. $h_{\textbf{w}^*}(\textbf{x}_i)=y_i, \forall i$. This can be viewed as the dataset has realization assumption, i.e., $R_P(h^*_\mathcal{H})=0$. The Perceptron algorithm works as follows. ![](https://i.imgur.com/WFprzqt.png) ### Stochastic Gradient Descent What is Perceptron actually doing? When the prediction made does not match the correct label, i.e., $\hat{y_t} \neq y_t$ , $y_t \langle\textbf{w}_t, \textbf{x}_t \rangle<0$, then $$y_t \langle\textbf{w}_{t+1}, \textbf{x}_t \rangle = y_t \langle\textbf{w}_{t}+y_t\textbf{x}_t, \textbf{x}_t \rangle = y_t \langle\textbf{w}_t, \textbf{x}_t \rangle + ||\textbf{x}_t||^2$$ It updates the vector $\textbf{w}$ to the "right" side. It turns out that the Perceptron algorithm is an ERM learning algorithm that uses *Stochastic Gradient Descent* as the optimization method. The loss function in here, however, is not simply the 0-1 loss. Let the output be $\hat{y}=\langle\textbf{w}, \textbf{x} \rangle$ and the original labeling be $\mathcal{Y}=\{\pm 1\}$. Consider the function that *penalize* linearly only if the sign is different: $$l(\hat{y}, y) = \max\{0, -y\hat{y}\}$$ We'll see its variant, called *hinge loss*, very quickly in the following. The corresponding risk is then $$R_P(\textbf{w}) = \mathbb{E}[\max\{0, -Y\langle\textbf{w}, \textbf{X} \rangle\}]$$ The stochastic approximation of the gradient is $$\nabla_\textbf{w}\left( -y\langle\textbf{w}, \textbf{x} \rangle \right) = \begin{cases} -y\textbf{x},&y\langle\textbf{w}, \textbf{x} \rangle<0 \\ 0, &y\langle\textbf{w}, \textbf{x} \rangle>0\\ \end{cases}$$ Therefore, Perceptron can be viewed as a unit-learning rate SGD with risk minimization about the new loss function. ### Convergence Moreover, it can be proof that it converges. :::success **Theorem (Novikoff’s Mistake bound)** For any input sequence $\{(\textbf{x}_1, y_1), ..., (\textbf{x}_T, y_T)\}$ with margin $\gamma$ and radius $r$, that is, $\forall t=1, ..., T, ||\textbf{x}_t||\leq r$ and satisfies the *margin condition* $$\exists \textbf{w}^*\in\mathbb{R}^d, \frac{y_t \langle\textbf{w}^*, \textbf{x}_t \rangle}{||\textbf{w}^*||} \geq \gamma$$ Then, over $T$ rounds, the number of corrections that Percepron makes is at most $(r/\gamma)^2$. ::: **[proof]** Whenever there is a mistake, $\textbf{w}_{t}$ will be updated to $\textbf{w}_{t+1} = \textbf{w}_{t}+y_t\textbf{x}_t$, but the change is not too much because $\textbf{x}_t$ is bounded: $$||\textbf{w}_{t+1}||^2 = ||\textbf{w}_{t}||^2 +2 \langle\textbf{w}_{t}, y_t\textbf{x}_t\rangle + ||\textbf{x}_{t}||^2 \leq ||\textbf{w}_{t}||^2 + r^2 $$ Note that the second term $2 \langle\textbf{w}_{t}, y_t\textbf{x}_t\rangle$ is negative since you make a mistake. Let $M$ be a subset that consists of the rounds that makes a mistake with $|M|=m$. So $$||\textbf{w}_{T+1}||^2 \leq ||\textbf{w}_{1}||^2 + mr^2 = mr^2$$ By the definition of updates, we can utilize the margin condition as follows: \begin{align} m\gamma &\leq \frac{\sum_{t\in M}y_t \langle\textbf{w}^*, \textbf{x}_t \rangle}{||\textbf{w}^*||}=\frac{\langle\textbf{w}^* \sum_{t\in M} y_t\textbf{x}_t \rangle}{||\textbf{w}^*||}\\ &\leq ||\sum_{t\in M} y_t\textbf{x}_t || = ||\sum_{t\in M} \textbf{w}_{t+1} - \textbf{w}_{t} || = ||\textbf{w}_{T+1} || \leq \sqrt{m}r \end{align} Thus $\sqrt{m} \leq(r/\gamma)$. Notice that Perceptron is an **Online** Learning Algorithm. ## Linear Support Vector Machine (SVM) We now consider a more refined set of functions, the *affine classifiers*: $$\mathcal{H}=\{h_{\textbf{w}}\mid \textbf{x} \rightarrow \text{sign}(\langle\textbf{w}, \textbf{x} \rangle + b) \}$$ Also, since parameters $(k\textbf{w}, kb)$ defines the same hyperplane to $(\textbf{w}, b)$, for canonicity, we'll choose the specific $(\textbf{w}, b)$ such that $$\min_{\textbf{x}_i}|\langle\textbf{w}, \textbf{x}_i \rangle + b|=1$$ We define the *support vectors* as those $\textbf{x}_i$ for which this minimum is achieved. ### Constrained optimization We define *margin* as the minimum distance between the classifier plane to any of the sample points. Given two data points $(\textbf{x}_+, 1), (\textbf{x}_-, -1)$ that are the support vector, the margin is then $$\mathsf{margin}=\frac{1}{2}\frac{\langle(\textbf{x}_+-\textbf{x}_-), \textbf{w}\rangle}{||\textbf{w}||}$$ since margin is the projection of half of the distance $(\textbf{x}_+-\textbf{x}_-)$ along the normal vector of the plane, which is vector $\textbf{w}$. Assume that the data is *linearly seperable*, then Support Vector Machine(SVM), is a learning algorithm that **maximizes the margin**. See the illustration below: ![](https://i.imgur.com/ChKx9hQ.png) ![](https://i.imgur.com/b6ROx4X.png) Why would we want to maximize the margin? Well, we can say that a classifier is safer if it has larger margin. If the margin is too small, even a small error in some data points could result in a misclassification. Consider the following constrained optimization problem (**quadratic programming**, to be more specific): \begin{align} \min_{\textbf{w}, b}& &||\textbf{w}||^2& & &\\ \text{subject to }& &y_i(\langle\textbf{w}, \textbf{x} \rangle +& b)\geq 1 & \forall i=1,...,n, \end{align} The constraint is to make sure that the classifer is indeed *correct*. The constraint is looser, however, if it reaches the *optimal* solution, it can have $\min_{\textbf{x}_i}|\langle\textbf{w}, \textbf{x}_i \rangle + b|=1$. Since $(\textbf{x}_+, 1), (\textbf{x}_-, -1)$ are support vectors, we have $\langle\textbf{w}, \textbf{x}_+ \rangle + b = 1$ and $\langle\textbf{w}, \textbf{x}_- \rangle + b = -1$. So margin is actually $1/||\textbf{w}||$. Minimizing $||\textbf{w}||^2$ is equivalent to maximizing $1/||\textbf{w}||^2$. ### What if... Well then, let's consider the following case. What is the *best* classifier if the data is **not** linearly seperable? We'll just allow some mistake and still try to maximize the margin, instead of taking care of all the extreme cases (which might occur due to some error). In general, there is a tradoff between the number of mistakes made and the margin. ![](https://i.imgur.com/d1opAdV.png =340x) ###### tags: `machine learning`