無標題 - HackMD

--- title: '[Layout] Multi-column' --- <style> .three-column-layout { column-count: 3; /* Set column number */ column-gap: 20px; max-width: 100%; overflow: hidden; } /* Media query for mobile devices */ @media (max-width: 768px) { .three-column-layout { column-count: 1; /* Switch to single column on small screens */ column-gap: 0; /* Optional: Set gap to 0 for single column */ } } .markdown-body, .ui-infobar { max-width: unset !important; } .three-column-layout ul, .three-column-layout ol { margin-top: 0; padding-left: 20px; } .three-column-layout strong { font-weight: bold; } .three-column-layout em { font-style: italic; } .two-column-layout h1, .two-column-layout h2, .two-column-layout h3, .two-column-layout h4, .two-column-layout h5, .two-column-layout h6 { margin-top: 0; } </style> # Lecture Notes for UVA neural networks Author: Chen, Po-Wei Insturctor: Hadi Daneshmand Citation: @book{hadi2025, title={UVA Lecture Notes for "Neural Networks: A Theory Lab"}, author={Po-Wei Chen and Hadi Daneshmand}, year={2025} } This lecture note starts with an introduction to the curse of dimensionality, including a specific example that illustrates why the sample size n needs to grow exponentially as the dimensionality d increases. It then delves into lab observations in high-dimensional contexts. Finally, the lecture provides a comprehensive explanation and proof detailing how and why the sample size n expands exponentially with the dimension d .   ## Curses of Dimensionality The increasing complexity of modern datasets, such as ImageNet and large-scale natural language models, necessitates a deeper understanding of the curse of dimensionality. Traditional datasets, like the 1988 Heart Disease dataset, contained a limited number of features (e.g., 13 features), whereas modern datasets can have hundreds of thousands of features. While deep learning models leverage these features for improved performance, the high-dimensional nature of data introduces significant challenges in function approximation and optimization. To illustrate the impact of high-dimensionality, consider vectors $v_1 \cdots v_n$ are sampled from i.i.d. $\mathcal{N}(0, 1/d)$ and the weight vector $w = \left[ \frac{1}{2}, \frac{1}{2}, 0, \dots, 0 \right] \in \mathbb{R}^d$. In high-dimensional spaces, a fundamental challenge arises because most random vectors $v_i$ are nearly orthogonal to the weight vector $w$, making their inner products $\langle v_i, w \rangle^2$ very small on average. The contributions from $v_i$ become negligible, leading to ineffective predictions or approximations. The condition $\mathbb{E}[\langle v_i, w \rangle^2] \geq \frac{1}{2}$ ensures that the random vectors $v_i$ are sufficiently aligned with $w$ to provide meaningful contributions. In the lecture 3 slides, we have provided the following result: \begin{aligned} &\frac{\log(n)}{d} \leq \mathbb{E}\left[\max_{i \leq n} \langle w, v_i \rangle^2 \right] \leq \frac{\log(n) + 1}{d}, \quad \text(1) \end{aligned} To ensure that at least one $v_i$ is sufficiently aligned with w and can effectively address the sparsity inherent in high-dimensional spaces, we must satisfy the condition $\mathbb{E}[\max_{i \leq n} \langle v_i, w \rangle^2] \geq \frac{1}{2}.$This requirement implies that the number of vectors n must satisfy $n \geq \exp(d/2)$ from the inequality above. As the dimensionality d increases, the number of vectors n required to maintain this alignment must grow exponentially, $n \geq \exp(d/2)$ , to counteract the sparsity of high-dimensional spaces. # Lab Observation ## Recap: Universal Non-Linear Feature Approximation We want to design universal non-linear features to approximate any function efficiently. A universal function approximator follows the form: $$y_i = \sum_{j} w_j \varphi_j(x_i),$$ <div style="display: flex; justify-content: center;"> <img src="https://hackmd.io/_uploads/B11iqPd_ye.png" width="30%"> <img src="https://hackmd.io/_uploads/SyCjcwOdye.png" width="30%"> <img src="https://hackmd.io/_uploads/ryVn5P__1x.png" width="30%"> </div> We discussed the representer theorem, which allows us to design ( n ) features for ( n ) samples to approximately fit all three of the above functions using a linear combination of the same features. However, a key challenge is feature selection: how do we construct a function approximation without requiring an exponential number of basis functions? Now, we aim to reduce the number of features from ( n ) to 20. To approximate a target function efficiently, we utilize random features: $$\phi(x) = \cos(\langle v, x \rangle + b).$$ The goal is to improve function approximation: $$y \approx \sum_{i} w_i \phi_i(x).$$ Striking universality of random features. The following plots show the performance of random features. <div style="display: flex; justify-content: center;"> <img src="https://hackmd.io/_uploads/Byhg1_uuJg.png" width="30%"> <img src="https://hackmd.io/_uploads/SJmZkudOkg.png" width="30%"> <img src="https://hackmd.io/_uploads/HkwZJuuOJx.png" width="30%"> </div> This approach is motivated by the limitations of kernel methods, as discussed in the last lecture notes.(https://hackmd.io/@hadidanesh/SJJYw3Lvke) ## Observing the Curse of Dimensionality However, our lab observations show that even random feature methods suffer from the curse of dimensionality: $f(x) = cos(\langle w, x \rangle) \approx f_n(x) = \sum_{i=1}^n \alpha_i \cos(\langle v_i, x \rangle + b_i),$ The approximation $f_n(x)$ aims to minimize the mean squared error between the true function $f(x)$ and the approximated function $f_n(x)$ : $\mathbb{E} \left[ (f(x) - f_n(x))^2 \right] \leq 0.0005.$ :::info [In-class room activitiy] Complete [colab](https://colab.research.google.com/drive/1SzLEF2EEMhjAXZkx86GuRxIwSkeTKbvs) and plot $n$ for which $f_n(x)$ : $\mathbb{E} \left[ (f(x) - f_n(x))^2 \right] \leq 0.0005.$ when verying $d$. ::: The following plots show that as d increases, n grows rapidly, emphasizing the challenge posed by the curse of dimensionality ([solution](https://colab.research.google.com/drive/12tj2k3i00D924fwTqmxvzMQ0JKcZgfc2)). ![image](https://hackmd.io/_uploads/Sk8qWO_uJg.jpg) # Mathematical Derivation We first show that if $v_i$ hits a neibourhood of $w$, then we can well approximate $f(x) = \cos(\langle w, x\rangle)$ with $\sum \alpha_i \cos(\langle v_i,x \rangle)$ (skip $b_i$ for simplicity.). $$\cos(\langle v_i, x \rangle ) - \cos( \langle w, x \rangle) \leq -2 \sin\left( \frac{\langle v_i + w, x\rangle }{2}\right) \sin\left(\frac{\langle v_i - w, x \rangle }{2}\right)$$ Thus $$|\cos(\langle v_i, x \rangle ) - \cos( \langle w, x \rangle)| \leq \| x \| \max \{\| w - v_i \|, \| w + v_i\| \} \quad \text{(why?)}$$ To ensure $\| w \mp v_i \|<\epsilon$, it is easy to check that we need $|\langle w, v_i \rangle|\geq 1 - \epsilon$ holds or equivalently $\langle w, v_i \rangle^2 \geq (1 - \epsilon)^2$. Notably, we need only one $v_i$ for which this equality holds. We prove that $\mathbb{E} \left[ \max_{i\leq n} \langle w, v_i \rangle^2\right] \approx \log(n)/d$, which means that we need $O(n=e^d)$ samples to ensure there is an $i$ such that $\langle w, v_i \rangle^2 \geq (1 - \epsilon)^2$. Bellow, we derive $\mathbb{E} \left[ \max_{i\leq n} \langle w, v_i \rangle^2\right] \approx \log(n)/d$ (Please check slides if you can not follow steps here). ## 1. Analyze the random variable $z_i = \langle w, v_i \rangle^2$: Each component $v_{ij}$ of $v_i$ follows a normal distribution $\mathcal{N}\left(0, \frac{1}{d}\right)$ . Since the inner product is given by $$\langle w, v_i \rangle = \frac{1}{2} v_{i1} + \frac{1}{2} v_{i2} + 0 \cdot v_{i3} + \dots + 0 \cdot v_{id} = \frac{v_{i1} + v_{i2}}{2},$$ it follows that $v_{i1} + v_{i2}$ is normally distributed as $\mathcal{N}\left(0, \frac{2}{d}\right)$ . Consequently, the inner product $$\langle w, v_i \rangle \sim \mathcal{N}\left(0, \frac{2}{d} \times \left(\frac{1}{2}\right)^2\right) = \mathcal{N}\left(0, \frac{1}{2d}\right).$$ Since squaring a standard normal variable results in a chi-squared distribution with one degree of freedom, i.e., if X $\sim \mathcal{N}(0, \sigma^2)$ , then $X^2 \sim \sigma^2 \chi^2_1$ and $\chi^2_1 \sim \text{Exp}(\frac{1}{2})$ , it follows that $$z_i = \langle w, v_i \rangle^2 \sim \frac{1}{2d} \cdot \text{Exp}\left(\frac{1}{2}\right).$$ ## 2. Expectation of the Maximum: To compute $\mathbb{E}\bigl[\max(Z_1,\dots,Z_n)\bigr]$, we first compute $\mathbb{E}\bigl[\max(X_1,\dots,X_n)\bigr]$ for i.i.d. exponential variables $X_i \sim \mathrm{Exp}(\lambda)$. We first determine the distribution of $X_{(1)}$ , the minimum of n i.i.d. exponential random variables $X_1, X_2, \dots, X_n$ with rate $\lambda$ , we start by computing its cumulative distribution function (CDF). By definition, $X_{(1)} = \min(X_1, X_2, \dots, X_n)$ , so for any $x \geq 0$ , the probability that $X_{(1)}$ is greater than $x$ means that all $X_i$ must be larger than $x$ . Since the $X_i$ are independent, $$P(X_{(1)} > x) = P(X_1 > x, X_2 > x, \dots, X_n > x) = \prod_{i=1}^{n} P(X_i > x) = e^{-n\lambda x}.$$ The CDF of $X_{(1)}$ is then given by $$F_{X_{(1)}}(x) = P(X_{(1)} \leq x) = 1 - P(X_{(1)} > x) = 1 - e^{-n\lambda x}, \quad x \geq 0.$$ Using the memoryless property of the exponential distribution, which states $P(X > s+t \mid X > s) = P(X > t)$ , we see that $X_{(2)} - X_{(1)} \sim \text{Exp}((n-1)\lambda)$ . Iterating this argument, we establish that the gaps $X_{(k)} - X_{(k-1)}$ follow an exponential distribution with decreasing rates, $(n-k+1)\lambda$ , for k = 2, $\dots, n$ . Summing these “gaps” yields $$\mathbb{E}[X_{(n)}] = \sum_{k=2}^{n} \mathbb{E}[X_{(k)} - X_{(k-1)}]+E[X_{(1)}] = \sum_{k=1}^{n} \frac{1}{(n-k+1)\,\lambda} = \frac{1}{\lambda} \sum_{i=1}^n \frac{1}{i}$$ Thus, $\mathbb{E}\bigl[\max(Z_1,\dots,Z_n)\bigr] = \mathbb{E}\bigl[\max(X_1,\dots,X_n)\bigr/2d]=\frac{1}{d} \sum_{i=1}^n \frac{1}{i}.$ ## 3. Prove $\log(n)\leq\sum_{i=1}^{n}\frac{1}{i}\leq \log(n)+1$ To establish the inequality, we use the integral approximation of the harmonic series: $$\int_1^n \frac{1}{x}dx \leq \sum_{i=1}^{n} \frac{1}{i} \leq 1 + \int_1^n \frac{1}{x}dx.$$ Since $\int_1^n \frac{dx}{x} = \log(n)$, it follows that $\log(n) \leq \sum_{i=1}^{n} \frac{1}{i} \leq \log(n) + 1.$ ![image](https://hackmd.io/_uploads/HkAuCfddJg.jpg) Thus, we have completed the proof that \begin{aligned} &\frac{\log(n)}{d} \leq \mathbb{E}\left[\max_{i \leq n} \langle w, v_i \rangle^2 \right] \leq \frac{\log(n) + 1}{d}. \quad \end{aligned} </div>

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.