Pattern Recognition and ML

# Pattern Recognition and Machine Learning == This deals with the second half of the course with the above title. The first half was taught by Prof. Naveen M. B. The second half mostly starts by looking at a glimse of what learning from data without assuming any information about the distribution generating the data; essentially searching for the pattern, if any, in the data. Let us start by looking at the following data-blue points above the hyperplane while red dots are below. Some aspects of the concept below have already been dealt with in the first half. So, it is a repetition. -----------------------![image](https://hackmd.io/_uploads/BykEYnGCa.png) ------------------------ **Figure 1:** The figure shows a simple data set that can "linearly separated" ofcourse by a hyperplane. Although stylized, studying the above provides a lot of insights. First, note that the above can be generated randomly from a distribution but with the caveat that it can be separated by a plane. Now, we shall ask what do we mean by separating the data by a hyperplane? What is a hyperplane? Towards this, consider the following picture of a hyperplane. A hyperplance can be described by two things (a) a point on the hyperplane $\mathbf{x}_0$, and (b) a normal to the hyperplane at the point $\mathbf{w}$, as shown in the figure above. It is clear that if you take any point $\mathbf{x}$ on the plane, then the line joining $\mathbf{x}$ and $\mathbf{x}_0$ is orthogonal to the normal $\mathbf{w}$. Mathematically, we have $$\langle \mathbf{w} \; , \mathbf{x} - \mathbf{x}_0\; \rangle = 0.$$ The above leads to $\langle \mathbf{w} \; , \mathbf{x}\; \rangle = -b := \langle \mathbf{w} \; , \mathbf{x}_0\; \rangle$. This leads to the following definition. *Definition 1:* The set of points $\mathcal{P}$ on the plane is defined by all those points $\mathbf{x} \in \mathbb{R}^d$ such that $\langle \mathbf{w} \; , \mathbf{x}\; \rangle = b$. Now we return to our other question, i.e., what do we mean by a separating hyperplane? Given a point $\mathbf{x}$, how to determine whether it is to the left or right of the hyperplane. Very easy! If $\mathbf{x}$ forms an angle less than $90^o$ with $\mathbf{w}$, then it is to the left, otherwise, to the right. Mathematically, if $\langle \mathbf{w} \; , \mathbf{x}\; \rangle + b > 0$, then $\mathbf{x}$ is to the left, otherwise to the right. Obviously, the equality is achieved if it is on the plane. Separating hyperplane is defined with respect to a labelled data $\{(\mathbf{x}_1,y_1),(\mathbf{x}_2,y_2),\ldots,(\mathbf{x}_m,y_m)\}$, where $y_i \in \{+1,-1\}$ for all $i$. The following is the precise definition of a separable data set. <div class="boxed"> *Definition $2$:* A data set $\{(\mathbf{x}_1,y_1),(\mathbf{x}_2,y_2),\ldots,(\mathbf{x}_m,y_m)\}$ is said to be *stirctly* separable by a hyperplane $(\mathbf{w},b)$ if for all data points $(\mathbf{x}_i,y_i)$ such that $y_i = +1$, we have $\langle \mathbf{w} \; , \mathbf{x}_i\; \rangle + b > 0$, and $\langle \mathbf{w} \; , \mathbf{x}\; \rangle + b < 0$ for all other points. </div> </body> </html> *Note:* Equality in the above definition means that some points lie on the plane in which case, the data is called linearly separable. ## Practice Problems: :grinning: The following assingments are baesd on ML, MMSE estimation and the support vector machine (SVM). **Problem $1$:** Consider a plane in three dimension given by $f(x,y, z) = 2 x - 3 y + \frac{2}{3} z$. Find the distance between a point $(-2,6,1)$ and the plane. **Problem $2$:** Let $X_1,X_2,\ldots,X_n$ be i.i.d. from $\texttt{Exp}(\theta)$. Find a Maximum Likelihood (ML) estimate of $\theta$. **Problem $3$:** Let $f_{X|\Theta}(x|\theta) = \theta e^{-\theta x}$ for $x \geq 0$ and $0$ otherwise. Let $f_{\Theta}(\theta) = \alpha e^{-\alpha \theta}$ for $\theta \geq 0$. Find MMSE and ML estimates of $\theta$ given a single observation $X=x$. Read about MMSE estimate before attempting this question. **Problem $4$:** Sketch the following sets assuming the dimension $d=2$: $$\mathbf{a}_i^T \mathbf{x} \leq b_i, i=1,2,\ldots,N,$$ where $\mathbf{a}_i, \mathbf{w} \in \mathbb{R}^d$. In general, what does these sets represent? Explain. **Problem $5$:** Consider the following linear regression problem with the abosolute loss $$\min_{\mathbf{w}} \sum_{i=1}^m | \langle \mathbf{w} \; , \mathbf{x}_i\; \rangle - y_i|.$$ Convert the above problem into a linear programme. *Hint:* Convert the non-linear $|*|$ function into a linear by defining $z_i := | \langle \mathbf{w} \; , \mathbf{x}_i\; \rangle - y_i|$, and see what should be the constraints. **Question:** How do you find a hyperplane that separates the data points? To answer this question, we will make our life easy by assuming the homogenous case, i.e., $b = 0$ case, and see what is required. First, we need $y_i \langle \mathbf{w} \; , \mathbf{x}_i\; \rangle > 0$ for all $i \in [m]$. Assuming linear separability of the data, there exists a $\mathbf{w}^*$ such that $y_i \langle \mathbf{w}^* \; , \mathbf{x}_i\; \rangle > 0$. Let $\gamma^*:= \min_{i \in [m]} y_i \langle \mathbf{w}^* \; , \mathbf{x}_i\; \rangle$, then $\frac{1}{\gamma^*} y_i \langle \mathbf{w}^* \; , \mathbf{x}_i\; \rangle = y_i \langle \frac{\mathbf{w}^*}{\gamma^*} \; , \mathbf{x}_i\; \rangle\geq 1$. Equivalently, there exists a $\mathbf{w}$ such that \begin{equation} y_i \langle \mathbf{w} \; , \mathbf{x}_i\; \rangle \geq 1. \end{equation} Now, the quest is to search for a $\mathbf{w}$ that satisfies the above equation. An easy (practical) way to find such a $\mathbf{w}$ is to make use of existing tools/functions that can solve for it. An immediate tool box that comes to our mind is the linear programming tool box, which solves the problem of the following form \begin{eqnarray} &\min_{\mathbf{w}}& \langle \mathbf{w} \; , \mathbf{x}\; \rangle \nonumber \\ &\text{subject to }& \mathbf{a}^T_i \mathbf{w} - b\leq 0 \text{ for all } i \in [m]. \end{eqnarray} To cast our problem into the form above is easy! Choose $\mathbf{x} = \mathbf{0}$, the zero vector, which makes the objective function vacous. Then, choose $\mathbf{a}_i = -y_i \mathbf{x}_i$ and $b = -1$. But, wait a minute. This does not provide an algorithm. It will only enable to use off-the-shelf solvers. What if we want our algorithm? To answer this, again, let us start with the homogenous case, i.e., with $b=0$. Now, let us check each data point, and if the hyperplane makes a mistake on the $i$-th data point, then make a correction. Lets look at a possible correction, which gives you the following algorithm, famously known as the Perceptron algorithm. --- **Algorithm $1$ (Perceptron Algorithm):** * Take $\{(\mathbf{x}_1,y_1),(\mathbf{x}_2,y_2),\ldots,(\mathbf{x}_m,y_m)\}$ as input. * **While** (1): do * If ($\exists~i$ such that $(\mathbf{x}_i,y_i)$ results in a error): $$\mathbf{w} \leftarrow \mathbf{w} + y_i \mathbf{x}_i$$ * Else *Return* $\mathbf{w}$ * **end while** --- If there is an error, then $y_i \langle \mathbf{w} \; , \mathbf{x}_i\; \rangle < 0$. In otherwords, $\langle \mathbf{w} \; , y_i \mathbf{x}_i\; \rangle < 0$. The objective is to make this positive so that it makes no mistake. This can be achieved by correcting for $\mathbf{w}$ by shifting it towards $y_i \mathbf{x}_i$, i.e., $\mathbf{w} \leftarrow \mathbf{w} + y_i \mathbf{x}_i$, as indicated above. Let us see the impact of the correction. Consider the data point on which there is error, and see what happens after correction: \begin{eqnarray} y_i \langle \mathbf{w}^{(t+1)} \; , \mathbf{x}_i\; \rangle &=& y_i \langle \mathbf{w}^{(t)} + y_i \mathbf{x}_i \; , \mathbf{x}_i\; \rangle \\ &=& y_i \langle \mathbf{w}^{(t)} \; , \mathbf{x}_i\; \rangle + ||\mathbf{x}_i||^2 \\ &\geq& y_i \langle \mathbf{w}^{(t)} \; , \mathbf{x}_i\; \rangle, \end{eqnarray} an improvement on the $i$-th data point! What's the guarantee that the above results in a perfect hyperplane? :smirk: It turns out that the above procedure indeed converges to a "good" hyperplane, as stated in the following theorem. <div class="boxed"> **Theorem:** **Algorithm $1$** converges to a hyperplane with zero error after $T \geq (BR)^2$ rounds, where $B:= \min\left\{||\mathbf{w}||: y_i \langle \mathbf{w} \; , \mathbf{x}_i\; \rangle \geq 1 \text{ for all } i \in [m] \right\}$ and $R:=\max \{||\mathbf{x}_i|| : i \in [m] \}$. </div> </body> </html> The above thoerem indicates that within the finite time, the algorithm stops resulting in the correct hyperplane as long as the feature vectors in the data set has finite length! The proof can be found in Chapter $9$ of [Shwartz and Ben David](https://www.amazon.in/Understanding-Machine-Learning-Theory-Algorithms-ebook/dp/B00J8LQU8I//). The book is available in the library of IIT Dharwad. Let us look at a simple two-dimensional example of data below. ![image](https://hackmd.io/_uploads/r14i0tC06.png) Is the "Decision boundary" shown in the figure the only hyperplane possible. Indeed there are uncountably many hyperplanes that result in perfect separation. Then, is there any notion of "good" hyperplane? Intuitively, we want the hyperplane to be in the "middle" of the two classes of the data set. What do we mean by "middle"? One way to think about the "good" hyperplane should be the one that is farthest from every data point (feature vector). In otherwords, given a hyperplane, compute the distance between the hyperplane and any given data point $\mathbf{x}_i$. For ease of presentation, let us denote the distance by $\texttt{dist}(\mathbf{x}_i, \mathcal{H})$, where $\mathcal{H}$ denotes the hyperplane. For a given hyperplane look at the data point with the shortest distance, i.e., $\min_{i \in [m]} \texttt{dist}(\mathbf{x}_i, \mathcal{H})$. Now, try to make this as large as possible by choosing an appropriate hyperplane. Mathematically, we need to solve the following problem: \begin{equation} \min_{\mathcal{H}} \min_{i \in [m]} \texttt{dist}(\mathbf{x}_i, \mathcal{H}). \end{equation} How do we solve the above problem? Before that, how do we compute the distance from a point to the hyperplane? Towards this, we will look at the following picture. ### figure will be inserted here! First, let us define the notion of distance. <div class="boxed"> *Definition:* The distance between a point $\mathbf{x}$ and the plane is the minimum of $||\mathbf{x} - \mathbf{v}||$ among all the points $\mathbf{v}$ on the plane. Mathematically, \begin{equation} \texttt{dist}(\mathbf{x},\mathcal{H}) := \min \left\{||\mathbf{x} - \mathbf{v}||: \mathbf{v} \in \mathcal{H}\right\}. \end{equation} </div> </body> </html> In the above, $\mathcal{H}$ is described by the hyperplane $\langle \mathbf{w} \; , \mathbf{z}\; \rangle + b$, where $\mathbf{w}$ is the orthogonal vector to the plane. Without loss of generality, let us assume $||\mathbf{w}|| = 1$ (why?). Intuitively, from the figure, among all the points on the plane, the point $\mathbf{v}$ that is obtained by dropping a perpendicular from $\mathbf{x}$ to the plane is the closest. If you could find such a point $\mathbf{v}$, then the distance $||\mathbf{x} - \mathbf{v}||$ will be the minimum distance. Based on this geometric intuition, we shall make a guess for the choice of $\mathbf{v}$. The perpendicular drop from $\mathbf{x}$ is nothing but $\mathbf{x} - \alpha \mathbf{w}$ for some $\alpha$. Now, we need to adjust $\alpha$ so that this is the right choice. Note that $\mathbf{v} \in \mathcal{H}$, and hence $\langle \mathbf{w} \; , \mathbf{v}\; \rangle + b = 0$. This results in $\alpha = \langle \mathbf{w} \; , \mathbf{x}\; \rangle + b$. Thus, $\mathbf{v} = \mathbf{x} - (\langle \mathbf{w} \; , \mathbf{x}\; \rangle + b) \mathbf{w}$. After making this guess, we need to prove that that this is the closest point and find the distance; note that we have not proved anything so far, we are just making intuitive guesses! First, the distance is simply $||\mathbf{x} - \mathbf{v}|| = |\langle \mathbf{w} \; , \mathbf{x}\; \rangle + b|$. Now, let us prove that this choice indeed is the closest point. Take any other point on the plane, say $\mathbf{u} \in \mathcal{H}$. Now, consider \begin{eqnarray} ||\mathbf{x} - \mathbf{u}||^2 &=& ||\mathbf{x} - \mathbf{v} + \mathbf{v} - \mathbf{u}|| \\ &=& ||\mathbf{x} - \mathbf{v}||^2 + ||\mathbf{v} - \mathbf{u}||^2 + 2 \langle \mathbf{x} -\mathbf{v} \; , \mathbf{v} - \mathbf{u}\; \rangle\\ &\geq& ||\mathbf{x} - \mathbf{v}||^2 + 2 \langle \mathbf{x} -\mathbf{v} \; , \mathbf{v} - \mathbf{u}\; \rangle\\ &=& ||\mathbf{x} - \mathbf{v}||^2 + 2 (\langle \mathbf{w} \; , \mathbf{x}\; \rangle + b)\langle \mathbf{w} \; , \mathbf{v} - \mathbf{u}\; \rangle \\ &=& ||\mathbf{x} - \mathbf{v}||^2, \end{eqnarray} where the last equality follows from the fact that $\langle \mathbf{w} \; , \mathbf{v} \; \rangle = \langle \mathbf{w} \; , \mathbf{u}\; \rangle = -b$. Now, it remains to find the distance. The distance is $||\mathbf{x} - \mathbf{v}|| = |\langle \mathbf{w} \; , \mathbf{x} \; \rangle + b|$ since $||\mathbf{w}|| = 1$. This leads to the following result. <div class="boxed"> **Theorem:** The distance from a plane $\langle \mathbf{w} \; , \mathbf{z}\; \rangle + b$ with $||\mathbf{w}||=1$ to a point $\mathbf{x}$ is given by $|\langle \mathbf{w} \; , \mathbf{x} \; \rangle + b|$. </div> </body> </html> With the above result, our problem of finding the "good" hyperplane boils down to \begin{eqnarray} &&\max_{(\mathbf{w}, b):||\mathbf{w}||=1} \min_{i \in [m]} \langle \mathbf{w} \; , \mathbf{x}_i\; \rangle + b \\ &\text{subject to}& y_i(\langle \mathbf{w} \; , \mathbf{x}_i\; \rangle + b) \geq 1 \text{ for all } i\in [m]. \end{eqnarray} Now, we have a concrete problem that we can attempt to solve. How to solve this?:grimacing: Whenever you are faced with a problem that seems complicated, turn it into a simpler one (by guess or by intuition). Then show that the simpler problem is equivalent to solving the orginal problem. Here simpler means that you know how to solve the problem. This leap is not so straightforward here. We will proceed on faith; trust me that the following problem is equivalent to the original problem. \begin{eqnarray} &&\max_{(\mathbf{w}, b):||\mathbf{w}||=1} \min_{i \in [m]} y_i(\langle \mathbf{w} \; , \mathbf{x}_i\; \rangle + b). \end{eqnarray} A few observations are in order. First, let us see that the above problem will ensure that the plane separates the two data sets $+1$ and $-1$. Suppose not, then for some data point $i \in [m]$, the plane results in an error, i.e., $y_i = +1$ but $\langle \mathbf{w} \; , \mathbf{x}_i\; \rangle + b < 0$; the plane is above the data point. In this case, $\min_{i \in [m]} y_i(\langle \mathbf{w} \; , \mathbf{x}_i\; \rangle + b) < 0$ leading to a small value. Thus, the problem discourages incorrect labelling. If $\mathbf{w}$ and $b$ are chosen so that there is no error, then the product $y_i(\langle \mathbf{w} \; , \mathbf{x}_i\; \rangle + b) > 0$ for all $i \in [m]$. Hence, $y_i(\langle \mathbf{w} \; , \mathbf{x}_i\; \rangle + b) = |\langle \mathbf{w} \; , \mathbf{x}_i\; \rangle + b|$. Replacing this in the above problem results in the original problem!! This argument can be made very precise to turn this into a proof. More details in the later version of this notes. How do we solve our new problem?:smirk: Looks like we need to simplify it even further. Suppose $(\mathbf{w}^*,b^*)$ is the optimal solution to the problem above. Then, the margin obtained is $\texttt{M}^*:= \min_i y_i (\left\langle \mathbf{w}^* \; , \mathbf{x}_i\; \right \rangle + b^*)$. Clearly, by linearity of the inner product, we have \begin{equation} y_i\left(\left \langle \frac{\mathbf{w}^*}{\texttt{M}^*} \; , \mathbf{x}_i\; \right \rangle + \frac{b^*}{\texttt{M}^*}\right) \geq 1. \end{equation} Thus, the new $\left(\frac{\mathbf{w}^*}{\texttt{M}^*}, \frac{b^*}{\texttt{M}^*}\right)$ satisfies the separability condition! Recall that the separability condition is given by $y_i\left(\left \langle \mathbf{w} \; , \mathbf{x}_i\; \right \rangle + b\right) \geq 1$. Naturally, if we pick a $\mathbf{w}$ that has the maximum norm that satisfies the separability condition, i.e., solving the following problem :::info \begin{eqnarray} (\mathbf{w}_0,b_0) \in &\arg \min_{(\mathbf{w},b)}& ||\mathbf{w}||^2 \\ &\text{subject to}& y_i\left(\left \langle \mathbf{w} \; , \mathbf{x}_i\; \right \rangle + b\right) \geq 1 ~~~~~ (\mathbf{P}1) \end{eqnarray} ::: after normalization gives \begin{eqnarray} y_i \left(\left \langle \mathbf{w}_0/||\mathbf{w}_0|| \; , \mathbf{x}_i\; \right \rangle + b/||\mathbf{w}_0|| \right) &=& \frac{y_i \left(\left \langle \mathbf{w}_0 \; , \mathbf{x}_i\; \right \rangle + b \right)}{||\mathbf{w}_0||} \\ &\geq& \frac{1}{||\mathbf{w}_0||} \\ &\geq& \frac{\texttt{M}^*}{||\mathbf{w}^*||} \\ &\geq& \texttt{M}^*. \end{eqnarray} Thus, $\left(\frac{\mathbf{w}_0}{||\mathbf{w}_0||}, \frac{b_0}{||\mathbf{w}_0||}\right)$ results in a smaller objective in problem **P**$1$ above. Thus, solving the problem **P**$2$ results in a solution to the problem **P**$1$. Thus, a robust/"good" hyperplane can be obtained by running the following Support Vector Machine (SVM) algorithm: --- **Algorithm: SVM** --- - **Input:** $m$ Labelled data points - **Solve:** \begin{eqnarray} (\mathbf{w}_0,b_0) \in &\arg \max_{(\mathbf{w},b)}& ||\mathbf{w}||^2 \\ &\text{subject to}& y_i\left(\left \langle \mathbf{w} \; , \mathbf{x}_i\; \right \rangle + b\right) \geq 1. ~~~~~ (\mathbf{P}1) \end{eqnarray} - **Output:** $\left(\frac{\mathbf{w}_0}{||\mathbf{w}_0||}, \frac{b_0}{||\mathbf{w}_0||}\right)$. --- Again, the same question arises. How do we solve $\mathbf{P}1$?:angry: Let us look at the following problem \begin{equation} \min_{\lambda_i \geq 0 \text{ for all } i \in [m]} \sum_{i=1}^m \lambda_i (1-y_i\left(\left \langle \mathbf{w} \; , \mathbf{x}_i\; \right \rangle + b\right)) = \left\{ \begin{array}{ll} -\infty & \text{if } y_i\left(\left \langle \mathbf{w} \; , \mathbf{x}_i\; \right \rangle + b\right) < 1 \text{ for all } i \in [m].\\ 0 & \text{otherwise} \end{array} \right. \end{equation} Guess why the above solution? The above problem results in $-\infty$ if the constraint of $\mathbf{P}1$ is not satisfied (even one constraint)! Thus, solving the following problem should give a solution to the problem in $\mathbf{P}1$: \begin{equation} \max_{(\mathbf{w},b)} \min_{\mathbf{\lambda} \succeq \mathbf{0}} \left[||\mathbf{w}||^2 + \sum_{i=1}^m \lambda_i (1-y_i\left(\left \langle \mathbf{w} \; , \mathbf{x}_i\; \right \rangle + b\right))\right]. \end{equation} The above is the Lagrangian formulation of the constrained problem. By a simple interchange, we have \begin{equation} \max_{(\mathbf{w},b)} \min_{\mathbf{\lambda} \succeq \mathbf{0}} \left[||\mathbf{w}||^2 + \sum_{i=1}^m \lambda_i (1-y_i\left(\left \langle \mathbf{w} \; , \mathbf{x}_i\; \right \rangle + b\right))\right] \geq \min_{\mathbf{\lambda} \succeq \mathbf{0}} \max_{(\mathbf{w},b)} \left[||\mathbf{w}||^2 + \sum_{i=1}^m \lambda_i (1-y_i\left(\left \langle \mathbf{w} \; , \mathbf{x}_i\; \right \rangle + b\right))\right]. \end{equation} Is the reverse inequality above true? In general, the answer is negative. Fortunately, it turns out that it is indeed true for the problem at hand. This is called the *strong duality* in the literature. Consider the homegenous case. Since the function is convex in $\mathbf{w}$ for a fixed **$\mathbf{\lambda}$**, by differentiation, we get the following solution $\mathbf{w}^* = \sum_{i=1}^m \lambda_i y_i \mathbf{x}_i$. This leads to the following important theorem. <div class="boxed"> **Theorem:** The solution to the problem $\mathbf{P}1$ is given by $\mathbf{w}^* = \sum_{i=1}^m \lambda_i y_i \mathbf{x}_i$, where $\lambda_i \geq 0$ for all $i \in [m]$ are the Lagrangian multipliers that can be choosen to satisfy the constraint. </div> </body> </html> Now, it remains to solve for $\lambda_i$, which can be easily done by substituting this back in the above equation and optimizing over $\lambda_i$, $i\in[m]$. The details are skipped here. So far, we have been looking at a scenario where the data is linearly separable. Is it a good assumption? Not really. Most of the data set are complicated and are separable by a very complicated boundaries (see the figure below)! The two classes are not separated by a plane. In fact, the following is an artificially generated data. However, in reality, the data set is more complicated. ![image](https://hackmd.io/_uploads/r1RBqoLyA.png) This motivates us to study the way by which we can classify the data sets that are not linearly separable. Consider the following example: \begin{equation} \Phi(x) = \left\{ \begin{array}{ll} +1 & \text{if } |x| \leq 3, \\ -1 & \text{otherwise.} \end{array} \right. \end{equation} Clearly, the data set is not linearly separable. However, if we "lift" the data set to a higher dimension, then it is possible to separate them linearly. A simple lifting would be (there are many choices to this but this is the one I choose) to take each point $x$ and construct a two dimensional vector $\Psi(x) = (x, x^2)$. Now, the data points nicely separates into a set of points above and below the $y=6$ (or $y=8.3$ line for that matter) line. This technique can be generalized in an easy way as explained next. First, - The mapping above is $\Phi: \mathbb{R} \rightarrow \mathbb{R}^2$. - In $\mathbb{R}^2$, the $y=6$ line separates the two classes. To generalize, we need to consider a map from the feature space $\mathcal{X}$ to a space which is convenient to be in? In the above example, the space was $\mathbb{R}^2$. This space is a vector space. Moreover, there is an inner product defined on this space; the standard one. The inner product induces the Euclidean norm. We need an additional technical condition called completeness; a notion that is useful when you are delaing with sequences. Complete inner product space is called a Hilebert space denoted $\mathcal{H}$. More precisely, we have the following definition. <div class="boxed"> **Definition:** The set $\mathcal{H}$ is said to be a Hilbert space if (a) $\mathcal{H}$ has inner product $\left \langle *,* \right \rangle: \mathcal{H} \times \mathcal{H} \rightarrow \mathbb{R}$, and (b) $\mathcal{H}$ is complete. </div> </body> </html> *Examples:* The following two examples illustrates how wide the applicability of Hilbert space theory is: - The usual vector space$\mathbb{R}^n$ with dot product $\mathbf{a}^T \mathbf{b}:=\sum_{i=1}^n a_i b_i$, where $a_i$ and $b_i$ are the $i$-th components of $\mathbf{a}$ and $\mathbf{b}$, respectively. Completeness is easy to prove. The norm induced by the above inner product is the usual Euclidean norm. - Consider the set $\mathcal{F}:= \{f: [0,1]^n \rightarrow \mathbb{R} ~| ~f \text{ is continuous }\}$. With the usual definition of additions of functions and multiplication of scalar with functions along with the fact that the sum of continuous function is continuous, it is easy to see that the set $\mathcal{F}$ is a vector space over $\mathbb{R}$. Now, we need to define the inner product. Here is a natural choice; for any $f, g \in \mathcal{F}$, $$\left \langle f,g \right \rangle := \int_{0}^1 \ldots \int_{0}^1 f(x) g(x) dx^n. $$ It is easy to see that the above is linear in both the arguments, and if $f=g$, then $\int_{0}^1 \ldots \int_{0}^1 f^2(x) dx^n \geq 0$, and equality if and only if $f(x) = 0$ for all $x$. This clearly shows that the above is an inner product. The last condition, i.e., completeness is a little involved. However, it can be proved that the above space is complete. Thus, $\mathcal{F}$ is a Hilbert space. The norm induced by the above inner product is \begin{equation} ||g||_{\mathcal{F}} := \sqrt{\int_{0}^1 \ldots \int_{0}^1 f^2(x) dx^n}. \end{equation} It is an easy exercise to show that it is a norm. In fact, this gives us a measuring tool to measure the "size" of a function. I encourage students to see the similarity of the above and Fourier theory. With the above tool which helps in "lifting" the feature vectors, it is time to see what the algorithm should be given a lifting function $\Phi: \mathcal{X} \rightarrow \mathcal{H}$. A natural step is to modify the SVM algorithm to accomodate the above change like the following --- **Algorithm: Hilbert SVM** --- :::danger - **Input:** $m$ Labelled data points - **Solve:** \begin{eqnarray} (\mathbf{w}_0,b_0) \in &\arg \max_{(\mathbf{w},b) \in \mathcal{H} \times \mathbb{R}}& ||\mathbf{w}||^2 \\ &\text{subject to}& y_i\left(\left \langle \mathbf{w} \; , \Phi(\mathbf{x}_i)\; \right \rangle + b\right) \geq 1. ~~~~~ (\mathbf{H}1) \end{eqnarray} - **Output:** $\left(\frac{\mathbf{w}_0}{||\mathbf{w}_0||}, \frac{b_0}{||\mathbf{w}_0||}\right)$. ::: In the above $\mathbf{w} \in \mathcal{H}$. Now, the question is how to solve $\mathbf{H}1$. Lagrangian to the rescue. Like how we used the tools from Lagrangian, here too, the same technique works. I will leave it to the reader to prove the following theorem but say that all that you need to do is to replace $\mathbf{x}_i$ by $\Phi(\mathbf{x}_i)$. <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>Page Title</title> <style> **Theorem:** The solution to the problem $\mathbf{H}1$ is given by $\mathbf{w}^* = \sum_{i=1}^m \lambda_i y_i \Phi(\mathbf{x}_i)$, where $\lambda_i \geq 0$ for all $i \in [m]$ are the Lagrangian multipliers that can be choosen to satisfy the constraint. /* Whatever that is inside this <style> tag is all styling for your markup / content structure. /* The . with the boxed represents that it is a class */ .boxed { background: #F0F9F2; color: black; border: 3px solid #535353; margin: 0px auto; width: 856px; padding: 10px; border-radius: 10px; } </style> </head> <body> <div class="boxed"> </div> </body> </html> --- Substituting the above optimal weight in the dual function, we get the following unconstrained dual optimization problem \begin{equation} \max_{\mathbf{\lambda}} \left[\sum_{i,j=1}^m \lambda_i \lambda_j y_i y_j \left \langle \Phi(\mathbf{x}_i),\Phi(\mathbf{x}_j) \right \rangle + \sum_{i=1}^m \lambda_i \left[1- y_i \left(\sum_{j=1}^m \lambda_j \left \langle \Phi(\mathbf{x}_j),\Phi(\mathbf{x}_i) \right \rangle\right)\right] \right]. \end{equation} Once we solve the above problem, we get the final solution. Despite all this, we have not really come up with ways of finding the map $\Phi()$ and a suitable Hilbert space $\mathcal{H}$! Lets focus on this next. **References** - Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, "Foundations of Machine Learning," MIT Press, Second Edition, 2018. - Shai Shalev-Shwartz, Shai Ben-David, "Understanding Machine Learning: From Theory to Algorithms," 1st Edition, Cambridge university press, 2014. ---