# Kernel Density Estimation
```python
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
```
## What is Density Estimation?
The goal of **density estimation** is to approximate the unknown probability density function (PDF), $f(x)$, of a continuous random variable $X$ using only a collection of random samples of $X$.
The simplest form of density estimation is simply plotting the **histogram** of the samples. This gives a stepped approximation of the underlying PDF.
For example, let $Y = X_1 + X_2$, where $X_1 \sim \text{Exp}(\lambda_1)$ and $X_2 \sim \text{Exp}(\lambda_2)$. We can approximate the PDF of $Y$ by generating samples and plotting the resulting histogram:
```python=
N = 1000
X1 = stats.expon.rvs(scale=1,size=N)
X2 = stats.expon.rvs(scale=2,size=N)
Y = X1 + X2
freqs,bins,_ = plt.hist(Y,bins=np.arange(0,15,0.5),density=True,alpha=0.5)
mids = (bins[:-1] + bins[1:])/2
plt.plot(mids,freqs,'.-')
plt.grid(True)
plt.show()
```

## Kernel Density Functions
**[Kernel Density Estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE)** is a powerful, non-parametric method that provides a smoother, continuous approximation of the true PDF, $f(x)$, than a histogram.
### Kernel Functions
A **kernel** is a function $K : \mathbb{R} \rightarrow \mathbb{R}$ that serves as a symmetric probability density function. It must satisfy three conditions:
1. $K(x) \ge 0$ for all $x \in \mathbb{R}$.
2. $K(x) = K(-x)$ (symmetric about $0$).
3. $\int_{-\infty}^{\infty} K(x) \, dx = 1$.
Common kernel functions include:
| Kernel Type | Function Definition |
| :--- | :--- |
| Triangular | $K(x) = 1 - \|x\|$, for $\|x\| \le 1$ |
| Rectangular | $K(x) = 1/2$, for $\|x\| \le 1$ |
| Gaussian | $K(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$ |
| Parabolic | $K(x) = \frac{3}{4}(1 - x^2)$, for $\|x\| \le 1$ |
```python=
# u(x) = 1 if |x| < 1
u = lambda x: np.heaviside(x + 1,1) - np.heaviside(x - 1,1)
x = np.linspace(-2,2,200)
K1 = (1 - np.abs(x))*u(x)
K2 = 0.5*u(x)
K3 = 1/np.sqrt(2*np.pi)*np.exp(-x**2/2)
K4 = 0.75*(1 - x**2)*u(x)
plt.plot(x,K1,label = 'Triangular')
plt.plot(x,K2,label = 'Rectangular')
plt.plot(x,K3,label = 'Gaussian')
plt.plot(x,K4,label = 'Parabolic')
plt.title('Kernel Functions'), plt.legend(), plt.grid(True)
plt.show()
```

### The KDE Formula
Given $N$ samples $x_1, \dots, x_N$, the kernel density function $\hat{f}_h(x)$ is defined by summing the contribution of a kernel placed at each sample point $x_i$:
$$
\hat{f}_h(x) = \frac{1}{Nh} \sum \limits_{i=1}^N K \left( \frac{x - x_i}{h} \right)
$$
The resulting function $\hat{f}_h(x)$ is a valid PDF that estimates $f(x)$.
### The Bandwidth Parameter ($h$)
The parameter $h > 0$ is the **bandwidth**. It controls the smoothness of the estimated density function:
1. **Increasing** $h$ yields a smoother estimate but may obscure fine details or the true shape of the distribution (oversmoothing).
2. **Decreasing** $h$ yields a rougher estimate that better reflects local data variations (undersmoothing).
```python=
N = 1000
X1 = stats.expon.rvs(scale=1,size=N)
X2 = stats.expon.rvs(scale=2,size=N)
Y = X1 + X2
```
```python=+
plt.hist(Y,bins=np.arange(0,15,0.5),density=True,alpha=0.5)
K = lambda x: 1/np.sqrt(2*np.pi)*np.exp(-x**2/2)
fh = lambda x,h: 1/(len(Y)*h)*sum([K((x - Y[i])/h) for i in range(len(Y))])
x = np.linspace(0,15,100)
plt.plot(x,fh(x,1),label='h = 1.0')
plt.plot(x,fh(x,0.5),label='h = 0.5')
plt.plot(x,fh(x,0.2),label='h = 0.2')
plt.title('Kernel Density Functions (Gaussian)'), plt.legend(), plt.grid(True)
plt.show()
```

```python=+
plt.hist(Y,bins=np.arange(0,15,0.5),density=True,alpha=0.5)
K = lambda x: 0.5*u(x)
fh = lambda x,h: 1/(len(Y)*h)*sum([K((x - Y[i])/h) for i in range(len(Y))])
x = np.linspace(0,15,100)
plt.plot(x,fh(x,0.5),label='h = 0.5')
plt.plot(x,fh(x,0.1),label='h = 0.2')
plt.title('Kernel Density Functions (Rectangular)'), plt.legend(), plt.grid(True)
plt.show()
```

## KDE with SciPy
The `scipy.stats` module provides a simplified way to perform KDE using the Gaussian kernel via the [`scipy.stats.gaussian_kde`](https://docs.scipy.org/doc/scipy/tutorial/stats/kernel_density_estimation.html) function.
* By default, `gaussian_kde` automatically computes an **optimal bandwidth** parameter based on established statistical methods.
* You can manually specify the bandwidth using the `bw_method` parameter.
The example below demonstrates using `gaussian_kde` on the same sum of exponential random variables, showing the fit for the automatically chosen bandwidth and two user-specified bandwidths.
```python=
N = 1000
X1 = stats.expon.rvs(scale=1,size=N)
X2 = stats.expon.rvs(scale=2,size=N)
Y = X1 + X2
plt.hist(Y,bins=np.arange(0,15,0.5),density=True,alpha=0.5)
kde1 = stats.gaussian_kde(Y)
kde2 = stats.gaussian_kde(Y,bw_method=1)
kde3 = stats.gaussian_kde(Y,bw_method=0.1)
x = np.linspace(0,15,100)
plt.plot(x,kde1(x),label='h = {:.2f}'.format(kde1.factor))
plt.plot(x,kde2(x),label='h = 1.00')
plt.plot(x,kde3(x),label='h = 0.10')
plt.title('Kernel Density Functions (Gaussian)'), plt.legend(), plt.grid(True)
plt.show()
```
