owned this note
owned this note
Soft $k$-NN
==
Kyunghyun Cho (New York University)
TL;DR: after all, isn't $k$-NN all we do?
in my course, i use $k$-NN as a bridge between a linear softmax classifier and a deep neural net via an adaptive radial basis function network. until this year, i've been considering the special case of $k=1$, i.e., 1-NN, only and from there on moved to the adaptive radial basis function network. i decided however to show them how $k$-NN with $k > 1$ could be implemented as a sequence of computational layers this year, hoping that this would facilitate students understanding the spectrum spanning between linear [softmax](https://link.springer.com/chapter/10.1007/978-3-642-76153-9_28) classification and deep learning.
we are given $D=\left\{ (x_1, y_1), \ldots, (x_N, y_N) \right\}$, where $x_n \in \mathbb{R}^d$ and $y_n$ is an associated label represented as a one-hot vector. let us construct a layer that computes the nearest neighbour of a new input $x$. this can be implemented by first computing the activation of each training instance:
\begin{align}
h^1_n =
\frac{\exp(-\beta \| x_n - x \|^2)}
{\sum_{n'=1}^N \exp(-\beta \| x_{n'} - x \|^2)}.
\end{align}
in the limit of $\beta \to \infty$, we notice that this activation saturates to either $0$ or $1$:
\begin{align}
h^1_n {\to}_{\beta \to \infty}
\begin{cases}
1, &\text{if $x_n$ is the nearest neighbour of $x$} \\
0, &\text{otherwise}
\end{cases}
\end{align}
the output from this 1-NN is then computed as
\begin{align}
\hat{y}^1 = \sum_{n=1}^N h^1_n y_n = Y^\top h^1,
\end{align}
where $h^1$ is a vector stacking $h^1_n$'s and
\begin{align}
Y=\left[
\begin{array}{c}
y_1 \\
\vdots \\
y_N
\end{array}
\right].
\end{align}
this was relatively straightforward with 1-NN. how do we extend it to 2-NN? to do so, we define a new computational layer that computes the following activation for each training instance:
\begin{align}
h^2_n =
\frac{\exp(-\beta (\| x_n - x \|^2 + \gamma h^1_n))}
{\sum_{n'=1}^N \exp(-\beta (\| x_{n'} - x \|^2 + \gamma h^1_n))}.
\end{align}
now we consider the limit of both $\beta\to \infty$ and $\gamma \to \infty$, at which this new activation also saturates to either 0 or 1:
\begin{align}
h^2_n \to_{\beta, \gamma \to \infty}
\begin{cases}
1, \text{if $x_n$ is the second nearest neighbour of $x$} \\
0, \text{otherwise}
\end{cases}
\end{align}
this magical property comes from the fact that $\gamma h_n^1$ effectively kills the *first* nearest neighbour's activation when $\gamma \to \infty$. this term does not affect any non-nearest neighbour instances, because $h_n^1=0$ for those instances.
the output from this 2-NN is then
\begin{align}
\hat{y}^2 = \frac{1}{2} \sum_{k=1}^2 \sum_{n=1}^N h^k_n y_n.
\end{align}
now you see where i'm getting at, right? let me generalizie this to the $k$-th nearest neighbour:
\begin{align}
h^k_n =
\frac{\exp(-\beta (\| x_n - x \|^2 + \gamma \sum_{k'=1}^{k-1} h^{k'}_n))}
{\sum_{n'=1}^N \exp(-\beta (\| x_{n'} - x \|^2 + \gamma \sum_{k'=1}^{k-1} h^{k'}_n))},
\end{align}
where we see some resemblance to [residual connections](http://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html) (add the previous layers' activations directly.)
In the limit of $\beta\to\infty$ and $\gamma \to \infty$,
\begin{align}
h^k_n \to_{\beta, \gamma \to \infty}
\begin{cases}
1, \text{if $x_n$ is the $k$-th nearest neighbour of $x$} \\
0, \text{otherwise}
\end{cases}
\end{align}
the output from this $K$-NN is then
\begin{align}
\hat{y}^K = \frac{1}{K} \sum_{k=1}^K \sum_{n=1}^N h_n^K y_n,
\end{align}
which is reminiscent of so-called [deeply supervised](http://www.jmlr.org/proceedings/papers/v38/lee15a.pdf) nets from a few years back.
it is not difificult to imagine not taking the infinite limits of $\beta$ and $\gamma$, which leads to soft $k$-NN.
In summary, soft $k$-NN consists of $k$ nonlinear layers. Each nonlinear layer consists of radial basis functions with training instances as bases (nonlinear activation), and further takes as input the sum of the previous layers' activations (residual connection.) each layer's activation is used to compute the softmax output (self-normalized) using the one-hot label vectors associated with the training instances, and we average the predictions from all the layers (deeply supervised).
of course, this perspective naturally leads us to think of generalization in which we replace training instances with learnable bases across all $k$ layers and learn them using backpropagation. this is what we call *deep learning*.