LMNN Algorithm - Large Margin Nearest Neighbor

# Thuật toán LMNN - Large Margin Nearest Neighbor # LMNN Algorithm - Large Margin Nearest Neighbor In this article, I will introduce an algorithm designed to improve the accuracy of the KNN ***(K-nearest Neighbors)*** algorithm that I discussed in the previous post. Instead of using standard distance ***(metric)*** measures like ***manhattan_distance*** or ***euclid_distance***, the ***LMNN*** algorithm learns a custom distance metric specifically for ***Classification*** tasks. ## 1. Concept of Metric and Pseudometric A ***Metric*** on a set ${X}$ is a ***mapping***: $d: X \times X \rightarrow R$ Satisfying: 1. $d(x_i, x_j) + d(x_j, x_k) \geq d(x_i, x_k)$ 2. $d(x_i, x_j) = d(x_j, x_i)$ 3. $d(x_i, x_j) \geq 0$ 4. $d(x_i, x_j) = 0$ ⇔ $x_i = x_j$ A ***Pseudometric*** is a distance measure that only needs to satisfy conditions 1, 2, and 3 above. It allows two different points $x_i, x_j$ to have a distance of 0. ## 2. Nature of the Algorithm The ***LMNN*** algorithm learns a ***pseudometric*** of the form $d(x_i, x_j) = (x_i - x_j)^TM(x_i - x_j)$. ***Where:*** M is a positive semi-definite matrix. An $n \times n$ matrix M is called positive semi-definite ***(positive semi-definite)*** if and only if $x^TMx \geq 0, \forall x \in R^n$. In the ***LMNN*** algorithm, there are two types of neighbors: ***Target Neighbor*** and ***Impostor***. Consider a training dataset $D = [(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)]$. ### 2.1 Target Neighbor Through the learned distance ***(mapping d)***, the target neighbors of a point $x_i$ are the closest points to $x_i$ that belong to the same class. If $x_j$ is a target neighbor of $x_i$, it does not necessarily mean that $x_i$ is a target neighbor of $x_j$. The set of target neighbors for $x_i$ is denoted as $N_i$. ### 2.2 Impostor A point $x_j$ is called an ***Impostor*** of $x_i$ if $x_j$ is one of the nearest neighbors of $x_i$ but belongs to a different class. ![](https://i.imgur.com/yeY4We4.png) The image is taken from [wiki](https://en.wikipedia.org/wiki/Large_margin_nearest_neighbor#:~:text=Large%20margin%20nearest%20neighbor%20) illustrating target neighbors and impostor points. As shown in the figure above, the goal of the method is to **learn a distance metric** instead of just using the ***Euclidean*** distance. In practice, the distances between entities are often ***non-Euclidean***. The idea is to pull the target neighbor points closer to $x_i$ and push the points that do not share the same class as $x_i$ away. ### 2.3 Mechanism of Operation To make the ***KNN*** algorithm effective, for each point $x_i$, the ***target neighbors*** under the mapping ***d*** should be closer than points with labels different from $x_i$. In other words: we need to learn ***a mapping*** or ***a distance function*** such that all ***target neighbors*** of $x_i$ form a spatial region around $x_i$ that does not contain any points from a ***different class*** than $x_i$. If we learn such a mapping ***d*** so that ***all training samples*** are surrounded by at least ***k points of the same class*** as the corresponding sample, then the prediction error will be optimized. Thus, modeling this desire in an optimization problem means minimizing the number of points that do not share the same class as $x_i$ ***(Impostors)***. The ***LMNN*** algorithm minimizes the number of ***Impostor*** points by maintaining a large value called the ***(margin)*** between the space of ***target neighbors*** and ***impostors***. This is why the algorithm is named ***Large Margin Nearest Neighbor***. We redefine $x_l$ as an ***Impostor*** neighbor of $x_i$ if: \begin{equation} d(x_i, x_l) \leq d(x_i, x_j) + 1, \forall x_j \in \text{ set of target neighbors} \end{equation} In other words, an ***Impostor*** $x_l$ of $x_i$ is a point from a different class than $x_i$ and lies within the spatial region (the total space containing the ***target neighbors*** plus a certain ***margin***, chosen as 1). ![](https://i.imgur.com/95PIUka.png) Looking at the figure, for each point $x_i$, before learning, its ***k neighbors*** may include both ***Impostors*** and ***Target neighbors***. After learning, the goal is to ***pull*** the ***target neighbors*** closer to $x_i$ and push the ***impostors*** away from $x_i$. Moreover, after learning, a ***margin*** region will form between the ***Impostors*** and ***Target Neighbors***. ### 2.4 Constructing the Loss Function in LMNN As mentioned above, the LMNN algorithm optimizes ***KNN*** by learning a mapping ***d*** such that, for a given value of k, any point $x_i$ should ideally be surrounded only by ***k target neighbors***. The optimization of the loss function is divided into two phases: + ***Pull*** the ***target neighbors*** closer to $x_i$ + ***Push*** the ***impostors*** away from $x_i$ Pulling the ***target neighbors*** closer to $x_i$ corresponds to optimizing the following function: \begin{equation} f_1(d) = \sum\limits_{i, j \in N_i} d(x_i, x_j) \end{equation} ***d is the pseudometric*** that the ***LMNN*** algorithm needs to learn. Before learning, the ***Impostors*** lie within the space of the ***Target Neighbors*** plus the ***margin***: \begin{equation} d(x_i, x_l) \leq d(x_i, x_j) + 1, \forall x_j \in \text{ set of target neighbors} \end{equation} Pushing the ***Impostor*** points away from $x_i$ corresponds to: \begin{equation} d(x_i, x_j) + 1 - d(x_i, x_l) < 0, \forall x_j \in \text{ set of target neighbors} \end{equation} This ***pushing*** process is equivalent to optimizing the following loss function: \begin{equation} f_2(d) = \sum\limits_{i, j \in N_i, l, y_l \neq y_i} [d(x_i, x_j) + 1 - d(x_i, x_l)]_+ \end{equation} **Note:** + The function $[x]_+ = \max(x, 0)$ + The ***margin*** value can be changed to other positive values by adjusting the weights of the matrix ***M*** that needs to be learned in the mapping ***d***. From these two processes, the overall loss function to optimize is: \begin{equation} L(d) = \mu f_1(d) + (1 - \mu) f_2(d), \quad \mu \in [0, 1] \end{equation} ***$\mu$ is also a hyperparameter to choose beforehand, controlling the relative weight of each component in the loss function $L(d)$***