Soft $k$-NN

Kyunghyun Cho (New York University)

TL;DR: after all, isn't $k$-NN all we do?

in my course, i use $k$-NN as a bridge between a linear softmax classifier and a deep neural net via an adaptive radial basis function network. until this year, i've been considering the special case of $k=1$, i.e., 1-NN, only and from there on moved to the adaptive radial basis function network. i decided however to show them how $k$-NN with $k > 1$ could be implemented as a sequence of computational layers this year, hoping that this would facilitate students understanding the spectrum spanning between linear softmax classification and deep learning.

we are given $D=\left\{ (x_1, y_1), \ldots, (x_N, y_N) \right\}$, where $x_n \in \mathbb{R}^d$ and $y_n$ is an associated label represented as a one-hot vector. let us construct a layer that computes the nearest neighbour of a new input $x$. this can be implemented by first computing the activation of each training instance:
\begin{align} h^1_n = \frac{\exp(-\beta \| x_n - x \|^2)} {\sum_{n'=1}^N \exp(-\beta \| x_{n'} - x \|^2)}. \end{align}

in the limit of $\beta \to \infty$, we notice that this activation saturates to either $0$ or $1$:
\begin{align} h^1_n {\to}_{\beta \to \infty} \begin{cases} 1, &\text{if $x_n$ is the nearest neighbour of $x$} \\ 0, &\text{otherwise} \end{cases} \end{align}

the output from this 1-NN is then computed as
\begin{align} \hat{y}^1 = \sum_{n=1}^N h^1_n y_n = Y^\top h^1, \end{align}

where $h^1$ is a vector stacking $h^1_n$'s and
\begin{align} Y=\left[ \begin{array}{c} y_1 \\ \vdots \\ y_N \end{array} \right]. \end{align}

this was relatively straightforward with 1-NN. how do we extend it to 2-NN? to do so, we define a new computational layer that computes the following activation for each training instance:
\begin{align} h^2_n = \frac{\exp(-\beta (\| x_n - x \|^2 + \gamma h^1_n))} {\sum_{n'=1}^N \exp(-\beta (\| x_{n'} - x \|^2 + \gamma h^1_n))}. \end{align}

now we consider the limit of both $\beta\to \infty$ and $\gamma \to \infty$, at which this new activation also saturates to either 0 or 1:
\begin{align} h^2_n \to_{\beta, \gamma \to \infty} \begin{cases} 1, \text{if $x_n$ is the second nearest neighbour of $x$} \\ 0, \text{otherwise} \end{cases} \end{align}

this magical property comes from the fact that $\gamma h_n^1$ effectively kills the first nearest neighbour's activation when $\gamma \to \infty$. this term does not affect any non-nearest neighbour instances, because $h_n^1=0$ for those instances.

the output from this 2-NN is then
\begin{align} \hat{y}^2 = \frac{1}{2} \sum_{k=1}^2 \sum_{n=1}^N h^k_n y_n. \end{align}

now you see where i'm getting at, right? let me generalizie this to the $k$-th nearest neighbour:
\begin{align} h^k_n = \frac{\exp(-\beta (\| x_n - x \|^2 + \gamma \sum_{k'=1}^{k-1} h^{k'}_n))} {\sum_{n'=1}^N \exp(-\beta (\| x_{n'} - x \|^2 + \gamma \sum_{k'=1}^{k-1} h^{k'}_n))}, \end{align}

where we see some resemblance to residual connections (add the previous layers' activations directly.)

In the limit of $\beta\to\infty$ and $\gamma \to \infty$,
\begin{align} h^k_n \to_{\beta, \gamma \to \infty} \begin{cases} 1, \text{if $x_n$ is the $k$-th nearest neighbour of $x$} \\ 0, \text{otherwise} \end{cases} \end{align}

the output from this $K$-NN is then
\begin{align} \hat{y}^K = \frac{1}{K} \sum_{k=1}^K \sum_{n=1}^N h_n^K y_n, \end{align}

which is reminiscent of so-called deeply supervised nets from a few years back.

it is not difificult to imagine not taking the infinite limits of $\beta$ and $\gamma$, which leads to soft $k$-NN.

In summary, soft $k$-NN consists of $k$ nonlinear layers. Each nonlinear layer consists of radial basis functions with training instances as bases (nonlinear activation), and further takes as input the sum of the previous layers' activations (residual connection.) each layer's activation is used to compute the softmax output (self-normalized) using the one-hot label vectors associated with the training instances, and we average the predictions from all the layers (deeply supervised).

of course, this perspective naturally leads us to think of generalization in which we replace training instances with learnable bases across all $k$ layers and learn them using backpropagation. this is what we call deep learning.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Soft \(k\)-NN