owned this note
owned this note
Published
Linked with GitHub
---
title: '[Layout] Multi-column'
---
<style>
.three-column-layout {
column-count: 3; /* Set column number */
column-gap: 20px;
max-width: 100%;
overflow: hidden;
}
/* Media query for mobile devices */
@media (max-width: 768px) {
.three-column-layout {
column-count: 1; /* Switch to single column on small screens */
column-gap: 0; /* Optional: Set gap to 0 for single column */
}
}
.markdown-body, .ui-infobar {
max-width: unset !important;
}
.three-column-layout ul,
.three-column-layout ol {
margin-top: 0;
padding-left: 20px;
}
.three-column-layout strong {
font-weight: bold;
}
.three-column-layout em {
font-style: italic;
}
.two-column-layout h1,
.two-column-layout h2,
.two-column-layout h3,
.two-column-layout h4,
.two-column-layout h5,
.two-column-layout h6 {
margin-top: 0;
}
</style>
# Lecture Notes for UVA neural networks
Author: Chen, Po-Wei
Insturctor: Hadi Daneshmand
Citation: @book{hadi2025,
title={UVA Lecture Notes for "Neural Networks: A Theory Lab"},
author={Po-Wei Chen and Hadi Daneshmand},
year={2025}
}
This lecture note starts with an introduction to the curse of dimensionality, including a specific example that illustrates why the sample size n needs to grow exponentially as the dimensionality d increases. It then delves into lab observations in high-dimensional contexts. Finally, the lecture provides a comprehensive explanation and proof detailing how and why the sample size n expands exponentially with the dimension d .
## Curses of Dimensionality
The increasing complexity of modern datasets, such as ImageNet and large-scale natural language models, necessitates a deeper understanding of the curse of dimensionality. Traditional datasets, like the 1988 Heart Disease dataset, contained a limited number of features (e.g., 13 features), whereas modern datasets can have hundreds of thousands of features. While deep learning models leverage these features for improved performance, the high-dimensional nature of data introduces significant challenges in function approximation and optimization.
To illustrate the impact of high-dimensionality, consider vectors $v_1 \cdots v_n$ are sampled from i.i.d. $\mathcal{N}(0, 1/d)$ and the weight vector $w = \left[ \frac{1}{2}, \frac{1}{2}, 0, \dots, 0 \right] \in \mathbb{R}^d$. In high-dimensional spaces, a fundamental challenge arises because most random vectors $v_i$ are nearly orthogonal to the weight vector $w$, making their inner products $\langle v_i, w \rangle^2$ very small on average. The contributions from $v_i$ become negligible, leading to ineffective predictions or approximations. The condition $\mathbb{E}[\langle v_i, w \rangle^2] \geq \frac{1}{2}$ ensures that the random vectors $v_i$ are sufficiently aligned with $w$ to provide meaningful contributions.
In the lecture 3 slides, we have provided the following result:
\begin{aligned}
&\frac{\log(n)}{d} \leq \mathbb{E}\left[\max_{i \leq n} \langle w, v_i \rangle^2 \right] \leq \frac{\log(n) + 1}{d}, \quad \text(1)
\end{aligned}
To ensure that at least one $v_i$ is sufficiently aligned with w and can effectively address the sparsity inherent in high-dimensional spaces, we must satisfy the condition $\mathbb{E}[\max_{i \leq n} \langle v_i, w \rangle^2] \geq \frac{1}{2}.$This requirement implies that the number of vectors n must satisfy $n \geq \exp(d/2)$ from the inequality above. As the dimensionality d increases, the number of vectors n required to maintain this alignment must grow exponentially, $n \geq \exp(d/2)$ , to counteract the sparsity of high-dimensional spaces.
# Lab Observation
## Recap: Universal Non-Linear Feature Approximation
We want to design universal non-linear features to approximate any function efficiently.
A universal function approximator follows the form:
$$y_i = \sum_{j} w_j \varphi_j(x_i),$$
<div style="display: flex; justify-content: center;">
<img src="https://hackmd.io/_uploads/B11iqPd_ye.png" width="30%">
<img src="https://hackmd.io/_uploads/SyCjcwOdye.png" width="30%">
<img src="https://hackmd.io/_uploads/ryVn5P__1x.png" width="30%">
</div>
We discussed the representer theorem, which allows us to design ( n ) features for ( n ) samples to approximately fit all three of the above functions using a linear combination of the same features. However, a key challenge is feature selection: how do we construct a function approximation without requiring an exponential number of basis functions?
Now, we aim to reduce the number of features from ( n ) to 20. To approximate a target function efficiently, we utilize random features:
$$\phi(x) = \cos(\langle v, x \rangle + b).$$
The goal is to improve function approximation:
$$y \approx \sum_{i} w_i \phi_i(x).$$
Striking universality of random features. The following plots show the performance of random features.
<div style="display: flex; justify-content: center;">
<img src="https://hackmd.io/_uploads/Byhg1_uuJg.png" width="30%">
<img src="https://hackmd.io/_uploads/SJmZkudOkg.png" width="30%">
<img src="https://hackmd.io/_uploads/HkwZJuuOJx.png" width="30%">
</div>
This approach is motivated by the limitations of kernel methods, as discussed in the last lecture notes.(https://hackmd.io/@hadidanesh/SJJYw3Lvke)
## Observing the Curse of Dimensionality
However, our lab observations show that even random feature methods suffer from the curse of dimensionality:
$f(x) = cos(\langle w, x \rangle) \approx f_n(x) = \sum_{i=1}^n \alpha_i \cos(\langle v_i, x \rangle + b_i),$ The approximation $f_n(x)$ aims to minimize the mean squared error between the true function $f(x)$ and the approximated function $f_n(x)$ : $\mathbb{E} \left[ (f(x) - f_n(x))^2 \right] \leq 0.0005.$
:::info
[In-class room activitiy]
Complete [colab](https://colab.research.google.com/drive/1SzLEF2EEMhjAXZkx86GuRxIwSkeTKbvs) and plot $n$ for which $f_n(x)$ : $\mathbb{E} \left[ (f(x) - f_n(x))^2 \right] \leq 0.0005.$ when verying $d$.
:::
The following plots show that as d increases, n grows rapidly, emphasizing the challenge posed by the curse of dimensionality ([solution](https://colab.research.google.com/drive/12tj2k3i00D924fwTqmxvzMQ0JKcZgfc2)).

# Mathematical Derivation
We first show that if $v_i$ hits a neibourhood of $w$, then we can well approximate $f(x) = \cos(\langle w, x\rangle)$ with $\sum \alpha_i \cos(\langle v_i,x \rangle)$ (skip $b_i$ for simplicity.).
$$\cos(\langle v_i, x \rangle ) - \cos( \langle w, x \rangle) \leq -2 \sin\left( \frac{\langle v_i + w, x\rangle }{2}\right) \sin\left(\frac{\langle v_i - w, x \rangle }{2}\right)$$
Thus
$$|\cos(\langle v_i, x \rangle ) - \cos( \langle w, x \rangle)| \leq \| x \| \max \{\| w - v_i \|, \| w + v_i\| \} \quad \text{(why?)}$$
To ensure $\| w \mp v_i \|<\epsilon$, it is easy to check that we need $|\langle w, v_i \rangle|\geq 1 - \epsilon$ holds or equivalently $\langle w, v_i \rangle^2 \geq (1 - \epsilon)^2$. Notably, we need only one $v_i$ for which this equality holds.
We prove that $\mathbb{E} \left[ \max_{i\leq n} \langle w, v_i \rangle^2\right] \approx \log(n)/d$, which means that we need $O(n=e^d)$ samples to ensure there is an $i$ such that $\langle w, v_i \rangle^2 \geq (1 - \epsilon)^2$.
Bellow, we derive $\mathbb{E} \left[ \max_{i\leq n} \langle w, v_i \rangle^2\right] \approx \log(n)/d$ (Please check slides if you can not follow steps here).
## 1. Analyze the random variable $z_i = \langle w, v_i \rangle^2$:
Each component $v_{ij}$ of $v_i$ follows a normal distribution $\mathcal{N}\left(0, \frac{1}{d}\right)$ . Since the inner product is given by
$$\langle w, v_i \rangle = \frac{1}{2} v_{i1} + \frac{1}{2} v_{i2} + 0 \cdot v_{i3} + \dots + 0 \cdot v_{id} = \frac{v_{i1} + v_{i2}}{2},$$
it follows that $v_{i1} + v_{i2}$ is normally distributed as $\mathcal{N}\left(0, \frac{2}{d}\right)$ . Consequently, the inner product
$$\langle w, v_i \rangle \sim \mathcal{N}\left(0, \frac{2}{d} \times \left(\frac{1}{2}\right)^2\right) = \mathcal{N}\left(0, \frac{1}{2d}\right).$$
Since squaring a standard normal variable results in a chi-squared distribution with one degree of freedom, i.e., if X $\sim \mathcal{N}(0, \sigma^2)$ , then $X^2 \sim \sigma^2 \chi^2_1$ and $\chi^2_1 \sim \text{Exp}(\frac{1}{2})$ , it follows that
$$z_i = \langle w, v_i \rangle^2 \sim \frac{1}{2d} \cdot \text{Exp}\left(\frac{1}{2}\right).$$
## 2. Expectation of the Maximum:
To compute $\mathbb{E}\bigl[\max(Z_1,\dots,Z_n)\bigr]$, we first compute $\mathbb{E}\bigl[\max(X_1,\dots,X_n)\bigr]$ for i.i.d. exponential variables $X_i \sim \mathrm{Exp}(\lambda)$.
We first determine the distribution of $X_{(1)}$ , the minimum of n i.i.d. exponential random variables $X_1, X_2, \dots, X_n$ with rate $\lambda$ , we start by computing its cumulative distribution function (CDF). By definition, $X_{(1)} = \min(X_1, X_2, \dots, X_n)$ , so for any $x \geq 0$ , the probability that $X_{(1)}$ is greater than $x$ means that all $X_i$ must be larger than $x$ .
Since the $X_i$ are independent,
$$P(X_{(1)} > x) = P(X_1 > x, X_2 > x, \dots, X_n > x) = \prod_{i=1}^{n} P(X_i > x) = e^{-n\lambda x}.$$
The CDF of $X_{(1)}$ is then given by
$$F_{X_{(1)}}(x) = P(X_{(1)} \leq x) = 1 - P(X_{(1)} > x) = 1 - e^{-n\lambda x}, \quad x \geq 0.$$
Using the memoryless property of the exponential distribution, which states $P(X > s+t \mid X > s) = P(X > t)$ , we see that $X_{(2)} - X_{(1)} \sim \text{Exp}((n-1)\lambda)$ . Iterating this argument, we establish that the gaps $X_{(k)} - X_{(k-1)}$ follow an exponential distribution with decreasing rates, $(n-k+1)\lambda$ , for k = 2, $\dots, n$ . Summing these “gaps” yields
$$\mathbb{E}[X_{(n)}]
= \sum_{k=2}^{n} \mathbb{E}[X_{(k)} - X_{(k-1)}]+E[X_{(1)}]
= \sum_{k=1}^{n} \frac{1}{(n-k+1)\,\lambda}
= \frac{1}{\lambda} \sum_{i=1}^n \frac{1}{i}$$
Thus, $\mathbb{E}\bigl[\max(Z_1,\dots,Z_n)\bigr] = \mathbb{E}\bigl[\max(X_1,\dots,X_n)\bigr/2d]=\frac{1}{d} \sum_{i=1}^n \frac{1}{i}.$
## 3. Prove $\log(n)\leq\sum_{i=1}^{n}\frac{1}{i}\leq \log(n)+1$
To establish the inequality, we use the integral approximation of the harmonic series:
$$\int_1^n \frac{1}{x}dx \leq \sum_{i=1}^{n} \frac{1}{i} \leq 1 + \int_1^n \frac{1}{x}dx.$$
Since
$\int_1^n \frac{dx}{x} = \log(n)$, it follows that $\log(n) \leq \sum_{i=1}^{n} \frac{1}{i} \leq \log(n) + 1.$

Thus, we have completed the proof that
\begin{aligned}
&\frac{\log(n)}{d} \leq \mathbb{E}\left[\max_{i \leq n} \langle w, v_i \rangle^2 \right] \leq \frac{\log(n) + 1}{d}. \quad
\end{aligned}
</div>