# High-Dimensional Space
## Introduction
High-dimensional data is becoming increasingly prevalent. High-dimensional spaces exhibit unique characteristics that differ significantly from familiar two- and three-dimensional spaces. This note introduces the concept of generating random points in high-dimensional spaces and the observation that distances between these points tend to become uniform as dimensionality increases. Additionally, we explore the counterintuitive phenomenon of the volume of the unit ball diminishing as the dimension grows, with most of the volume concentrating near the surface and the equator.
## The Law of Large Numbers
The Law of Large Numbers states that as we increase the number of independent random variables and take their average, this average will tend to get closer and closer to the expected value of those variables. In the context of high-dimensional spaces, this has an interesting implication: the distances between randomly generated points tend to become almost the same.
**Theorem (Law of Large Numbers)** Let $x_1, x_2, \dots, x_n$ be $n$ independent samples of a random variable $x$. Then
$$
\text{Prob} \left( |\frac{x_1 + x_2 + \dots+ x_n}{n} - \mathbb{E}(x)| \geq \epsilon \right) \leq \frac{\text{Var}(x)}{n \epsilon^2}
$$
where $\epsilon$ is a positive number representing the deviation from the expected value.
* **In simpler words:** The probability that the average of $n$ samples deviates from the true expected value by more than $\epsilon$ is bounded by the variance of the random variable divided by $n$ times $\epsilon^2$.
#### Implication in High Dimensions:
* The squared distance between two points in a high-dimensional space can be thought of as the sum of many independent random variables (the squared differences in each dimension).
* As the dimensionality increases ($n$ gets larger), the Law of Large Numbers tells us that the average of these squared differences will converge to their expected value.
* This results in distances between random points becoming more and more uniform.
### Inequalities Used in the Proof
We leverage two key inequalities to prove the Law of Large Numbers:
#### Markov’s Inequality
Theorem (Markov’s inequality) If $x$ is a non-negative random variable and $a > 0$, then
$$
\text{Prob}(x \geq a) \leq \frac{\mathbb{E}(x)}{a}
$$
**Proof.** For a continuous nonnegative random variable x with probability density $p$,
\begin{split}
\mathbb{E}(x) = \int^{\infty}_0 xp(x)dx
&\geq \int^{\infty}_a xp(x)dx \\
&\geq a \int^{\infty}_a p(x)dx
= a \cdot \text{Prob}(x \geq a) \quad \Box
\end{split}
#### Chebyshev’s Inequality
Theorem (Chebyshev’s inequality) For any random variable $x$ and $c > 0$, then
$$
\text{Prob}\left(|x − \mathbb{E}(x)| \geq c \right) \leq \frac{\text{Var}(x)}{c^2}
$$
**Proof.** Markov’s inequality implies
\begin{split}
\text{Prob}\left(|x − \mathbb{E}(x)|
\geq c \right)
&= \text{Prob}\left(|x − \mathbb{E}(x)|^2 \geq c^2 \right) \\
&\leq \frac{\mathbb{E}(|x − \mathbb{E}(x)|^2)}{c^2}
= \frac{\text{Var(x)}}{c^2}
\end{split}
### Proof of the Law of Large Numbers
1. **Expectation of the Sample Mean:** The expected value of the sample mean is equal to the expected value of the underlying random variable:
$$
\mathbb{E}\left[\frac{x_1 + x_2 + \dots + x_n}{n}\right]
= \frac{\mathbb{E}(x_1) + \mathbb{E}(x_2) + \dots + \mathbb{E}(x_n)}{n}
= \mathbb{E}(x)
$$
2. **Apply Chebyshev's Inequality:** We apply Chebyshev's inequality to $X = \frac{x_1 + x_2 + \dots + x_n}{n}$ and $c = \epsilon$. Then we have: $$\text{Prob}\left(|\frac{x_1 + x_2 + \dots + x_n}{n} - \mathbb{E}[\frac{(x_1 + x_2 + \dots + x_n)}{n}] | \geq \epsilon \right) \leq \frac{\text{Var}(\frac{x_1 + x_2 + \dots + x_n}{n})}{\epsilon^2}$$
3. **Simplify using properties of variance:**
* We know from step 1 that $\mathbb{E}[(x_1 + x_2 + \dots + x_n)/n] = \mathbb{E}(x)$. Also:
* **Variance of a sum of independent variables:** If $X$ and $Y$ are independent, $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
* **Variance scales quadratically:** $\text{Var}(aX) = a^2\text{Var}(X)$ for any constant $a$
* Using these, we simplify the right side of the inequality: \begin{split} \text{Var}(\frac{x_1 + x_2 + \dots + x_n}{n}) &= \frac{\text{Var}(x_1 + x_2 + \dots + x_n)}{n^2} \\ &= \frac{\text{Var}(x_1) + \text{Var}(x_2) + \dots + \text{Var}(x_n)}{n^2} \\ &= \frac{\text{Var}(x)}{n^2} \end{split}
4. **Conclusion:** Combining steps 2 and 3, we get the desired result: $$\text{Prob} \left( |\frac{x_1 + x_2 + \dots+ x_n}{n} - \mathbb{E}(x)| \geq \epsilon \right) \leq \frac{\text{Var}(x)}{n \epsilon^2}$$
### Tail Bounds Table
| | Condition | Tail bound |
|--------|-----------|------------|
| Markov | $x \geq 0$ | $\text{Prob}(x \geq a) \leq \frac{\mathbb{E}(x)}{a}$ |
| Chebyshev | Any $x$ | $\text{Prob} (\mid x − \mathbb{E}(x) \mid \geq c ) \leq \frac{\text{Var}(x)}{c^2}$ |
| Chernoff | $x = x_1 + x_2 + \dots+ x_n$<br>$x_i \in [0, 1]$ i.i.d. Bernoulli | $\text{Prob}(\|x − E(x)\| \geq \epsilon \mathbb{E}(x)) \leq 3e^{-c \epsilon^2 \mathbb{E}(x)}$ |
| Higher Moments | $r$ positive even integer | $\text{Prob}(\|x\| \geq a) \leq \frac{\mathbb{E}(x^r)}{a^r}$ |
| Gaussian Annulus |$x = \sqrt{x_1^2 + x_2^2 +\dots+ x_n^2}$ $x_i \sim N(0, 1)$ indep.; $\beta \leq \sqrt{n}$ | $\text{Prob}(\|x − \sqrt{n}\| \geq \beta) \leq 3e^{-c \beta^2}$ |
| Power Law for $x_i$; order $k \geq 4$ | $x = x_1 + x_2 + \dots + x_n$ $x_i$ i.i.d; $\epsilon \leq \frac{1}{k^2}$ | $\text{Prob} \left(\|x − \mathbb{E}(x)\| \geq \epsilon \mathbb{E}(x) \right) \leq (\frac{4}{\epsilon^2 kn})^{\frac{k-3}{2}}$ |
## Properties of High-Dimensional Unit Balls
High-dimensional unit balls possess some surprising properties that set them apart from their lower-dimensional counterparts.
### Diminishing Surface Area and Volume
As the dimension $d$ increases, the surface area and volume of a unit ball shrinks dramatically, eventually approaching zero. The surface area $A(d)$ and volume $V(d)$ is given by:
$$
A(d) = \frac{2 \pi^{\frac{d}{2}}}{\Gamma(\frac{d}{2})}, \quad V(d) = \frac{2\pi^{\frac{d}{2}}}{d \Gamma(\frac{d}{2})}
$$
where $\Gamma(x)$ is the Gamma function (a generalization of the factorial function for non-integer values of $x$).
### Volume Concentration near the Surface
Most of the volume of a high-dimensional unit ball is located near its surface. If we shrink the ball slightly, we lose a significant portion of its volume. Mathematically:
* For a unit ball $S$ and a shrink factor $\epsilon$, the volume of the shrunken ball $(1-\epsilon)S$ is $$\text{volume}((1-\epsilon)S) = (1-\epsilon)^d \text{volume}(S) \leq e^{-d \epsilon} \text{volume}(S)$$
* As $d$ grows, the shrunken ball's volume becomes negligible compared to the original, indicating that most of the volume resides in a thin shell near the surface.
### Volume Concentration near the Equator
The volume also concentrates near the "equator" of the ball. For any unit vector $\mathbf{v}$ (think of it as pointing to the "north pole"), most of the volume lies in a thin slab of points whose dot product with $\mathbf{v}$ is small (on the order of $1/\sqrt{d}$).
**Theorem** For $c \geq 1$ and $d \geq 3$, at least a $1 − \frac{2}{c} e^{\frac{−c^2}{2}}$ fraction of the volume of the $d$-dimensional unit ball has $|x_1| \leq \frac{c}{\sqrt{d−1}}$.

* This means most of the volume is where the first coordinate $x_1$ is small, corresponding to the region near the equator if $\mathbf{v}$ is the first coordinate vector.
**Proof.**
* Due to the symmetry of the ball, it suffices to prove that at most a $\frac2c e^{-c^2/2}$ fraction of the *half* of the ball with $x_1 \geq 0$ has $x_1 \geq \frac{c}{\sqrt{d-1}}$.
* Let's define:
* $A$: the portion of the ball with $x_1 \geq \frac{c}{\sqrt{d-1}}$
* $H$: the upper hemisphere (where $x_1 \geq 0$)
* We aim to show that the ratio of the volume of $A$ to the volume of $H$ is small: $$\frac{\text{volume}(A)}{\text{volume}(H)} \geq \frac2c e^{-\frac{c^2}{2}}$$
* **Calculate the volume of $A$:** $$\text{volume}(A) = \int_{\frac{c}{\sqrt{d-1}}}^{1}(1-x_1^2)^{\frac{d-1}{2}} V(d-1) dx_1$$
* **Upper bound on $\text{volume}(A)$:**
* We use the inequality $1 - x \leq e^{-x}$ and extend the integral to infinity to obtain an upper bound \begin{split} \text{volume}(A) &= \int_{\frac{c}{\sqrt{d-1}}}^{1}\left(1-x_1^2\right)^{\frac{d-1}{2}}V(d-1)dx_1 \\ &\leq \int_{\frac{c}{\sqrt{d-1}}}^{\infty} e^{-\frac{d-1}{2}x_1^2} V(d-1)dx_1 \end{split}
* To simplify the integral, we make a substitution:
* Let $u = \frac{\sqrt{d-1}}{c} x_1$.
* Then, $du = \frac{\sqrt{d-1}}{c} dx_1$, and the range of integration change to $$\int_{1}^{\infty} e^{-\frac{c^2 u^2}{2}x_1^2} V(d-1)du$$
* Further, we have \begin{split} \text{volume}(A) &\leq \int_{1}^{\infty} e^{-\frac{c^2u^2}{2}} du \\
&\leq \int_{1}^{\infty} u e^{-\frac{c^2u^2}{2}} du \\ &= -\frac{1}{c^2}e^{-\frac{c^2u^2}{2}} \biggl \bracevert_{1}^{\infty} = \frac{e^{\frac{-c^2}{2}}}{d-1} \end{split}
* **Lower bound on $\text{volume}(H)$:**
* The volume of the hemisphere $H$ is half the volume of the unit ball: $\text{volume}(H) = 1/2 V(d)$
* We can also get a lower bound on $\text{volume}(H)$ by considering a cylinder within the hemisphere below the plane $x_1 = \frac{1}{\sqrt{d}}$. This cylinder has:
* height $\frac{1}{\sqrt{d}}$ (This ensures it fits within the hemisphere)
* radius $\sqrt{1-\frac{1}{d-1}}$ (Using the Pythagorean theorem on the cross-section)
* The volume of this cylinder is: $V(d-1)(1-\frac{1}{d-1})^{\frac{d-1}{2}}\frac{1}{\sqrt{d-1}}$
* Using the inequality $(1-x)^a \geq 1 - ax$ for $a \geq 1$, we simplify the cylinder's volume:
\begin{split}
V(d-1) \left(1-\frac{1}{d-1} \right)^{\frac{d-1}{2}} \frac{1}{\sqrt{d-1}}
&\geq V(d-1) \left(1 - \frac{d-1}{2(d-1)} \right) \frac{1}{\sqrt{d-1}} \\
&= \frac{V(d-1)}{2 \sqrt{d-1}}
\end{split}
* **Final Ratio:** We now have an upper bound for $\text{volume}(A)$ and a lower bound for $\text{volume}(H)$. We can use these to bound the ratio: $$\frac{\text{volume}(A)}{\text{volume}(H)}
\leq \frac{\frac{V(d-1)}{c\sqrt{d-1}}e^{\frac{-c^2}{2}}}{\frac{V(d-1)}{2\sqrt{d-1}}} = \frac2c e^{-\frac{c^2}{2}}$$
$\Box$
### Near Orthogonality
A consequence of the above is that two random points inside the unit ball are likely to be nearly orthogonal (perpendicular) to each other. Their vectors tend to lie close to the surface and near the equator.
**Theorem** If we randomly pick $n$ points $\mathbf{x}_1, \dots, \mathbf{x}_n$ from the unit ball, with probability $1 − O(1/n)$:
* All points are close to the surface: $|\mathbf{x}_i| \geq 1 - \frac{2 \ln n}{d}$ for all $i$
* All pairs of points are nearly orthogonal: $|\mathbf{x}_i \cdot \mathbf{x}_j| \leq \frac{\sqrt{6 \ln n}}{d-1}$ for all $i \neq j$
**Proof.**
* **Proximity to the surface:** For any fixed $i$, the probability that $|\mathbf{x}_i| < 1 -\epsilon$ is less than $e^{- \epsilon d}$ (from the analysis in the previous section). Setting $\epsilon = \frac{2 \ln n}{d}$, we have $$\text{Prob} \left(|\mathbf{x}_i| < 1-\frac{2\ln n}{d} \right) \leq e^{- (-\frac{2\ln n}{d}) d} = \frac{1}{n^2}$$ By applying the union bound, the probability that there exists an $i$ such that $|\mathbf{x}_i| < 1 - \frac{2\ln n}{d}$ is at most $1/n$.
* **Near orthogonality:**
* The previous theorem states that for a component of a Gaussian vector, the probability $|\mathbf{x}_i| > \frac{c}{\sqrt{d-1}}$ is at most $\frac2c e^{− \frac{c^2}{2}}$.
* There are $\binom n2$ pairs $i$ and $j$. For each pair, if we designate $\mathbf{x}_i$ as "north," the probability that the projection of $\mathbf{x}_j$ onto the "north" direction is more than $\frac{\sqrt{6 \ln n}}{\sqrt{d-1}}$ is at most $O(e^{−6 \frac{\ln n}{2}}) = O(n^{−3})$.
* By the union bound, the probability that the dot product condition is violated for any pair is at most $O\left(\binom n2 n^{-3} \right) = O(1/n)$.
$\Box$
## Generating Uniform Random Points within a Ball
Generating points uniformly at random within a high-dimensional ball is more nuanced than it might initially appear.
### Challenges of the Naive Approach
In two dimensions, a simple approach is to generate points within a square encompassing the unit circle and then project those points onto the circle. However, this leads to a non-uniform distribution on the circle.
* **Why it fails:** Points lying on a line from the origin to a square's vertex are more likely than those on a line to an edge's midpoint, due to the difference in line length. While discarding points outside the circle and projecting the rest can achieve uniformity in 2D, this method becomes highly inefficient in higher dimensions. The ball's volume shrinks dramatically compared to the enclosing cube, resulting in most generated points being discarded.

### Solution: Leveraging the Spherical Gaussian Distribution
We can efficiently generate uniform points within a ball by harnessing the properties of the spherical Gaussian distribution. Here's the breakdown:
1. **Generate from a Spherical Gaussian:**
* Create a point $\mathbf{x}$ in d-dimensional space.
* Each coordinate $x_i$ is an independent standard normal random variable (mean 0, variance 1).
* The probability density function of $\mathbf{x}$ is: $$p(\mathbf{x}) = \frac{1}{(2\pi)^{\frac{d}{2}}} e^{-\frac{|\mathbf{x}|^2}{2}}$$
2. **Project onto the Unit Sphere:**
* Normalize $\mathbf{x}$ to get a unit vector: $\frac{\mathbf{x}}{|\mathbf{x}|}$.
* This ensures the point lies on the surface of the unit sphere.
3. **Scale to Fit within the Ball:**
* To get a point $\mathbf{y}$ uniformly inside the whole ball, scale the unit vector by a random radius $\rho$.
* $\rho$ is drawn from the interval $[0, 1]$ with a density proportional to $r^{d-1}$. This accounts for the increasing volume of shells at larger radii in higher dimensions.
* The final point is: $y = \rho \frac{x}{|x|}$
#### Implementation
```python=
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def random_point_in_unit_ball(dimension, num_points):
"""
Generate random points uniformly distributed in a unit ball.
Parameters:
dimension (int): The dimension of the space
num_points (int): The number of points to generate
Returns:
numpy.ndarray: Array of shape (num_points, dimension) containing the generated points
"""
pts = np.random.normal(0, 1, (num_points, dimension))
pts = pts / np.linalg.norm(pts, axis=1, keepdims=True)
sca = np.random.uniform(0, 1, (num_points, 1))
sca = sca**(1/dimension)
return pts * sca
```
```python=
# Generate 5000 points in 2D
P = random_point_in_unit_ball(2, 5000)
plt.figure(figsize=(6, 6))
plt.scatter(P[:, 0], P[:, 1], marker='.')
plt.xlim(-1.1, 1.1)
plt.ylim(-1.1, 1.1)
plt.show()
```

```python=
# Generate 5000 points in 3D
P = random_point_in_unit_ball(3, 5000)
# Create a 3D scatter plot
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
# Scatter plot
ax.scatter(P[:, 0], P[:, 1], P[:, 2], c=P[:, 2], cmap='viridis', s=20)
# Set labels and title
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('5000 Random Points in 3D Unit Ball')
# Set axis limits
ax.set_xlim(-1, 1)
ax.set_ylim(-1, 1)
ax.set_zlim(-1, 1)
# Add a unit sphere wireframe
u = np.linspace(0, 2 * np.pi, 100)
v = np.linspace(0, np.pi, 100)
x = np.outer(np.cos(u), np.sin(v))
y = np.outer(np.sin(u), np.sin(v))
z = np.outer(np.ones(np.size(u)), np.cos(v))
ax.plot_wireframe(x, y, z, color='r', alpha=0.1)
plt.tight_layout()
plt.show()
```

## Gaussians in High Dimension
High-dimensional Gaussian distributions exhibit behavior that might surprise us based on our experience with lower dimensions. Instead of being concentrated around the mean, the probability mass tends to cluster in a thin, ring-like region called an annulus, located at a specific distance from the center.
* **Radius of the Gaussian**
* The mean squared distance of a point from the center is $d$.
* The square root of this, $\sqrt{d}$, is called the **radius of the Gaussian**.
### The Gaussian Annulus Theorem
This phenomenon is captured by the Gaussian Annulus Theorem:
**Theorem (Gaussian Annulus Theorem)**
For a $d$-dimensional spherical Gaussian with unit variance in each direction, and for any $\beta \leq \sqrt{d}$, all but at most $3e^{-c\beta^2}$ of the probability mass lies within the annulus
$$
\sqrt{d} - \beta \leq |\mathbf{x}| \leq \sqrt{d} + \beta
$$
where $c$ is a fixed positive constant.
* **Interpretation**
* Most of the probability mass is concentrated within a thin annulus (a ring) around the radius $\sqrt{d}$.
* $\beta$ controls the width of this annulus.
* The probability of a point falling outside this annulus decreases exponentially as $\beta$ increases.
**Proof.**
* Let $\mathbf{x} = (x_1,x_2,...,x_d)$ be a point selected from a unit variance Gaussian centered at the origin.
* Let $r= |\mathbf{x}|$. Then, the statement $\sqrt{d}-\beta \leq \vert \mathbf{x} \vert \leq \sqrt{d}+\beta$ is equivalent to $\vert r-\sqrt{d} \vert \geq \beta$.
* If $\vert r-\sqrt{d} \vert \geq \beta$, then multiplying both sides by $r+\sqrt{d}$ gives $\vert r^2-d \vert \geq \beta(r+\sqrt{d}) \geq \beta\sqrt{d}$.
* So, it suffices to bound the probability that $\vert r^2-d \vert \geq \beta\sqrt{d}$
* **Change of variables:**
* We have $r^2 -d = x^2_1+x^2_2+ \dots +x^2_d-d = (x^2_1-1)+(x^2_1-1)+ \dots +(x^2_d-1)$.
* Let $y_i = x^2_i - 1$.
* Notice that $\mathbb{E}(y_i) = \mathbb{E}(x_i^2-1) = 1-1 =0$.
* **Bounding moments:** To apply Theorem 12.5 to prove $y_1 + \dots + y_n \geq \beta \sqrt{d}$, we need to bound the $s$-th moments of $y_i$.
* For $|x_i| \leq 1$, we have $|y_i|^s \leq 1$.
* For $|x_i| \geq 1$, we have $|y_i|^s \leq |x_i|^{2s}$.
* Therefore,
\begin{split}
|\mathbb{E}(y_i^s)|
&= \mathbb{E}(|y_i|^s) \leq \mathbb{E}(1 + x_i^{2s}) = 1 + \mathbb{E}(x_i^{2s}) \\
&= 1 + \frac{1}{\sqrt{2\pi}} \int_{0}^{\infty} x^{2s} e^{-x^2/2} dx
\end{split}
* Let $x^2 = 2z$ in the integral. Then, $x= \sqrt{2z}$ and $dx= \frac{1}{\sqrt{2z}}dz$. Substituting these into the integral, we get $$|\mathbb{E}(y_i^s)| = 1 + \frac{2^{s }}{\sqrt{\pi}} \int_{0}^{\infty} z^{s - 1/2} e^{-z} dz \leq 2^s s!$$ The last inequality is from the Gamma integral.
* **Scaling again:**
* We have $\text{Var}(y) = \mathbb{E}(y_i^2)\leq 2^2 2!=8$
* To apply Theorem 12.5, we need $|\mathbb{E}(y_i^s)| \leq \sigma^2 s! = 8 s!$, but we don't have that.
* To address this, we introduce another change of variables: Let $w_i = \frac{y_i}{2 \sqrt{2}}$.
* Then, $\text{Var}(w_i) = \text{Var}(y_i)/8 \leq 1$ and $|\mathbb{E}(w_i^s)| = \frac{1}{2^{3s/2}} |\mathbb{E}(y_i^s)| \leq \frac{2^s s!}{2^{3s/2}} \leq s!$.
* Our goal is now to bound the probability that $|w_1 + w_2 + ... + w_d| \geq \frac{\beta \sqrt{d}}{2\sqrt{2}}$.
* Applying Theorem 12.5 with $\sigma^2 = 1$ and $n = d$, we get $$\mathbb{P}\left(|w_1 + w_2 + \dots + w_d| \geq \frac{\beta \sqrt{d}}{2\sqrt{2}} \right) \leq 3e^{-\frac{\beta^2}{24}}$$
$\Box$
## Random Projection and the Johnson-Lindenstrauss Lemma
**Random projection** is a powerful technique for dimensionality reduction. It allows us to project data from a high-dimensional space into a lower-dimensional one while approximately preserving the distances between points. This is particularly valuable in tasks like nearest neighbor search, where high dimensionality can lead to a computational bottleneck.
### The Core Idea
The essence of random projection is to use a random linear transformation to map data points from a $d$-dimensional space to a $k$-dimensional subspace (where $k$ is much smaller than $d$).
* **The Transformation:** This transformation is defined by $k$ randomly chosen vectors $\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_k$ in $\mathbb{R}^d$, each with coordinates drawn independently from a standard normal distribution (mean 0, variance 1).
* **The Projection:** The projection $f$ of a vector $\mathbf{v}$ is computed by taking the dot product of $\mathbf{v}$ with each of the random vectors: $$f(\mathbf{v}) = (\mathbf{u}_1 \cdot \mathbf{v}, \mathbf{u}_2 \cdot \mathbf{v}, \dots, \mathbf{u}_k \cdot \mathbf{v})$$
### Random Projection Theorem
This theorem provides a probabilistic guarantee that the length of a projected vector is close to its expected value.
**Theorem (Random Projection Theorem)** Let $\mathbf{v}$ be a fixed vector in $\mathbb{R}^d$ and $f$ be the random projection defined above. Then, there exists a constant $c > 0$ such that for any $0 < \epsilon < 1$:
$$
\text{Prob} \left(|(|f(\mathbf{v})| - \sqrt{k}|\mathbf{v}|)| \geq \epsilon \sqrt{k}|\mathbf{v}| \right) \leq 3e^{-ck\epsilon^2}
$$
* **In simpler terms:** The probability that the length of the projected vector deviates significantly (by more than $\epsilon \sqrt{k}|\mathbf{v})|$) from its expected length $\sqrt{k}|\mathbf{v}|$ is exponentially small in $k$ (the dimension of the target subspace).
**Proof.**
* Without loss of generality, we can assume $|\mathbf{v}| = 1$ by scaling both sides of the inner inequality by $|\mathbf{v}|$.
* The dot product $\mathbf{u}_i \cdot \mathbf{v} = \sum_{j=1}^d u_{ij} v_j$ is a linear combination of independent Gaussian random variables $u_{ij}$ (the components of $u_i$).
* Recall that the sum of independent normal random variables is itself normally distributed, with mean and variance equal to the sum of the individual means and variances.
* Since each $u_{ij}$ has zero mean and unit variance, we have: $$\text{Var}(\mathbf{u}_i \cdot \mathbf{v}) = \text{Var}(\sum_{j=1}^d u_{ij} v_j) = \sum_{j=1}^d v_j^2 \text{Var}(u_{ij}) = \sum_{j=1}^d v_j^2 = 1$$
* Therefore, the random variable $\mathbf{u}_i \cdot \mathbf{v}$ follows a standard normal distribution (Gaussian with zero mean and unit variance).
* Since $\mathbf{u}_1 \cdot \mathbf{v}, \mathbf{u}_2 \cdot \mathbf{v}, \dots, \mathbf{u}_k \cdot \mathbf{v}$ are independent standard normal random variables, the projection $f(\mathbf{v})$ is a random vector drawn from a $k$-dimensional spherical Gaussian with unit variance in each coordinate. We can now apply the Gaussian Annulus Theorem with $d$ replaced by $k$ and the result follows.
$\Box$
### Johnson-Lindenstrauss Lemma
This lemma extends the idea to multiple points, showing that pairwise distances are approximately preserved with high probability.
**Theorem (Johnson-Lindenstrauss Lemma)** For any $0 < \epsilon < 1$ and any integer $n$, let $k \geq \frac{3 \ln n}{c\epsilon^2}$, where $c$ is the constant from the Random Projection Theorem. For any set of $n$ points in $\mathbb{R}^d$, the random projection $f$ has the property that for all pairs of points $\mathbf{v}_i$ and $\mathbf{v}_j$, with probability at least $1 - \frac{3}{2n}$:
$$
(1-\epsilon)\sqrt{k} |\mathbf{v}_i - \mathbf{v}_j| ≤ |f(\mathbf{v}_i) - f(\mathbf{v}_j)| \leq (1 + \epsilon) \sqrt{k} |\mathbf{v}_i - \mathbf{v}_j|
$$
**Proof.**
* Applying the Random Projection Theorem to $\mathbf{v}_i - \mathbf{v}_j$, the probability that $|f(\mathbf{v}_i − \mathbf{v}_j)|$ is outside the range $$\left[(1-\epsilon)\sqrt{k} |\mathbf{v}_i - \mathbf{v}_j|, (1 + \epsilon) \sqrt{k} |\mathbf{v}_i - \mathbf{v}_j| \right]$$ is at most $3 e^{−ck\epsilon^2} \leq 3/n^3$ for $k \geq \frac{3\ln n}{c\epsilon^2}$.
* Since there are $\binom n2 \leq \frac{n^2}{2}$ pairs of points, by the union bound, the probability that any pair has a large distortion is less than $\frac{3}{2n}$.
$\Box$
#### Significance
* This lemma is crucial for applications like nearest neighbor search.
* It allows us to project data into a much lower-dimensional space while ensuring that the relative distances between points are approximately maintained.
* This leads to significant computational savings without sacrificing much accuracy.
## Separating Gaussians in High Dimensions
This section explores the problem of separating data points generated from a mixture of two Gaussian distributions in high-dimensional space. The key insight is that if the means (centers) of these Gaussians are sufficiently far apart, we can accurately classify each data point back to its originating Gaussian.
### The Scenario
We consider a mixture of two spherical Gaussians:
* Both have unit variance in each dimension.
* Their means are separated by a distance $\Delta$.
We'll show that if this separation $\Delta$ is on the order of $\Omega(d^{1/4})$ (where $d$ is the dimensionality), we can correctly identify the source Gaussian for each data point with high probability.
### Intuition
* Points from the same Gaussian: Distances between points from the same Gaussian are roughly $\sqrt{2d}$. This is because:
* Points tend to lie near a sphere of radius $\sqrt{d}$ (Gaussian Annulus Theorem).
* In high dimensions, these points are nearly orthogonal (perpendicular) to each other.
* **Points from different Gaussians:** Distances between points from different Gaussians (centers separated by $\Delta$) are approximately $\sqrt{\Delta^2 + 2d}$.
* The vectors connecting points to their Gaussian centers and the vector between the centers are nearly perpendicular.

### Separation Condition
For accurate separation, the maximum distance within a Gaussian should be less than the minimum distance between Gaussians:
$$
\sqrt{2d} + O(1) \leq \sqrt{\Delta^2 + 2d} - O(1)
$$
This holds when $\Delta$ is in $\omega(d^{1/4})$.
* **Multiple Comparisons:** If we have $n$ points and want high probability of correctly separating all of them, we need a slightly larger separation: $\Delta \in \omega(d^{1/4} \sqrt{\log n})$
### Algorithm for Separation
1. **Calculate Pairwise Distances:** Compute distances between all pairs of points
2. **Identify Clusters:** The cluster with the smallest pairwise distances likely comes from one Gaussian
3. **Separate:** Remove this cluster; the remaining points belong to the other Gaussian
<font color=red>Bonus implementation and experiment</font>
## Fitting a Spherical Gaussian to Data
This section tackles the problem of finding the spherical Gaussian distribution that best fits a given set of data points in a high-dimensional space. We employ the **Maximum Likelihood Estimator (MLE)** to determine the parameters (mean and variance) of the Gaussian that maximize the likelihood of observing the given data.
### The Setup
* We have $n$ sample points: $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n$ in a $d$-dimensional space
* We want to find the best-fitting spherical Gaussian distribution $f$
* Mean:$\pmb{\mu}$
* Variance in each direction: $\sigma^2$
### The Likelihood Function
The probability density of observing these points under the Gaussian $f$ is:
$$
\frac{1}{2\pi^{\frac n2}} \cdot \exp \left(- \frac{(\mathbf{x}_1 - \pmb{\mu})^2 + (\mathbf{x}_2 - \pmb{\mu})^2 + \dots + (\mathbf{x}_n - \pmb{\mu})^2)}{2\sigma^2} \right)
$$
The MLE of $f$ is the one that maximizes this probability density.
### Finding the MLE
Two helpful lemmas guide us:
**Lemma** The sum of squared distances
$$
(\mathbf{x}_1 - \pmb{\mu})^2 + (\mathbf{x}_2 - \pmb{\mu})^2 + \dots + (\mathbf{x}_n - \pmb{\mu})^2
$$
is minimized when $\pmb{\mu}$ is the centroid (average) of the data points.
**Lemma** The maximum likelihood estimate for the variance $\sigma^2$ is the average squared distance of the sample points from their true mean.
**Proof.**
* To determine the maximum likelihood estimate of $\sigma^2$ for $f$, set $\pmb{\mu}$ to the true centroid.
* Next, we'll show that $\sigma$ is set to the standard deviation of the sample.
* Substitute $\nu = \frac{1}{2\sigma^2}$ and $a = (\mathbf{x}_1 − \pmb{\mu})^2 + (\mathbf{x}_2 − \pmb{\mu})^2 + \dots +(\mathbf{x}_n − \pmb{\mu})^2$ into the likelihood function (the probability density of observing the data points): $$\frac{e^{-a \nu}}{\left[{\int_{x}e^{-x^2 \nu}dx}\right]^n}$$
* Now, $a$ is fixed, and we want to find the value of $\nu$ that maximizes this expression.
* **Maximizing the Likelihood:**
* Taking the natural logarithm of the likelihood function simplifies the expression and doesn't change the location of the maximum: $$-a\nu -n\ln \left[{\int_{x}e^{-\nu x^2}dx} \right]$$
* To find the maximum, we differentiate this expression with respect to $\nu$ and set the derivative equal to zero: $$0 = -a+n\frac{{\int_{x}|x|^2e^{-x^2v}dx}}{{\int_{x}e^{-x^2v}dx}}$$
* We make a change of variables in the integrals: Let $y=|\sqrt{\nu}\mathbf{x}|$. Then we have $$0 = -a+\frac{n}{\nu}\frac{{\int_{y}y^2e^{-y^2}dy}}{{\int_{y}e^{-y^2}dy}}$$
* The ratio of the two integrals represents the expected squared distance of a $d$-dimensional spherical Gaussian with standard deviation $\frac{1}{\sqrt{2}}$ to its center. This is known to be $\frac{d}{2}$. So, we get $$−a+\frac{nd}{2\nu}$$
* Substituting back $\sigma^2$ for $\frac{1}{2 \nu}$ gives $$−a+n d\sigma^2$$
* Setting $−a+n d\sigma^2 = 0$ shows that the maximum occurs when $$\sigma = \frac{\sqrt{a}}{\sqrt{nd}}$$
$\Box$
### Conclusion
Combining these lemmas, we get:
* **The maximum likelihood spherical Gaussian for a set of samples is the one with:**
* Center equal to the sample mean
* Standard deviation equal to the standard deviation of the sample points from their true mean
## Reference
* Chapter 2 & 12 of Blum A, Hopcroft J, Kannan R. Foundations of Data Science. Cambridge University Press; 2020.
## Further Reading
* Vershynin R. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press; 2018. [link](https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.pdf)
* S. Dasgupta, "Learning Mixtures of Gaussians," in 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, New York, New York, 1999 pp. 634. doi: 10.1109/SFFCS.1999.814639
* Nearest neighbor search: https://en.wikipedia.org/wiki/Nearest_neighbor_search
* Gaussian mixture models: https://scikit-learn.org/stable/modules/mixture.html