# Kernel Density Estimation ```python import numpy as np import matplotlib.pyplot as plt from scipy import stats ``` ## What is Density Estimation? The goal of **density estimation** is to approximate the unknown probability density function (PDF), $f(x)$, of a continuous random variable $X$ using only a collection of random samples of $X$. The simplest form of density estimation is simply plotting the **histogram** of the samples. This gives a stepped approximation of the underlying PDF. For example, let $Y = X_1 + X_2$, where $X_1 \sim \text{Exp}(\lambda_1)$ and $X_2 \sim \text{Exp}(\lambda_2)$. We can approximate the PDF of $Y$ by generating samples and plotting the resulting histogram: ```python= N = 1000 X1 = stats.expon.rvs(scale=1,size=N) X2 = stats.expon.rvs(scale=2,size=N) Y = X1 + X2 freqs,bins,_ = plt.hist(Y,bins=np.arange(0,15,0.5),density=True,alpha=0.5) mids = (bins[:-1] + bins[1:])/2 plt.plot(mids,freqs,'.-') plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/Sy-4TTRall.png) ## Kernel Density Functions **[Kernel Density Estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE)** is a powerful, non-parametric method that provides a smoother, continuous approximation of the true PDF, $f(x)$, than a histogram. ### Kernel Functions A **kernel** is a function $K : \mathbb{R} \rightarrow \mathbb{R}$ that serves as a symmetric probability density function. It must satisfy three conditions: 1. $K(x) \ge 0$ for all $x \in \mathbb{R}$. 2. $K(x) = K(-x)$ (symmetric about $0$). 3. $\int_{-\infty}^{\infty} K(x) \, dx = 1$. Common kernel functions include: | Kernel Type | Function Definition | | :--- | :--- | | Triangular | $K(x) = 1 - \|x\|$, for $\|x\| \le 1$ | | Rectangular | $K(x) = 1/2$, for $\|x\| \le 1$ | | Gaussian | $K(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$ | | Parabolic | $K(x) = \frac{3}{4}(1 - x^2)$, for $\|x\| \le 1$ | ```python= # u(x) = 1 if |x| < 1 u = lambda x: np.heaviside(x + 1,1) - np.heaviside(x - 1,1) x = np.linspace(-2,2,200) K1 = (1 - np.abs(x))*u(x) K2 = 0.5*u(x) K3 = 1/np.sqrt(2*np.pi)*np.exp(-x**2/2) K4 = 0.75*(1 - x**2)*u(x) plt.plot(x,K1,label = 'Triangular') plt.plot(x,K2,label = 'Rectangular') plt.plot(x,K3,label = 'Gaussian') plt.plot(x,K4,label = 'Parabolic') plt.title('Kernel Functions'), plt.legend(), plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/B1m2ApR6xx.png) ### The KDE Formula Given $N$ samples $x_1, \dots, x_N$, the kernel density function $\hat{f}_h(x)$ is defined by summing the contribution of a kernel placed at each sample point $x_i$: $$ \hat{f}_h(x) = \frac{1}{Nh} \sum \limits_{i=1}^N K \left( \frac{x - x_i}{h} \right) $$ The resulting function $\hat{f}_h(x)$ is a valid PDF that estimates $f(x)$. ### The Bandwidth Parameter ($h$) The parameter $h > 0$ is the **bandwidth**. It controls the smoothness of the estimated density function: 1. **Increasing** $h$ yields a smoother estimate but may obscure fine details or the true shape of the distribution (oversmoothing). 2. **Decreasing** $h$ yields a rougher estimate that better reflects local data variations (undersmoothing). ```python= N = 1000 X1 = stats.expon.rvs(scale=1,size=N) X2 = stats.expon.rvs(scale=2,size=N) Y = X1 + X2 ``` ```python=+ plt.hist(Y,bins=np.arange(0,15,0.5),density=True,alpha=0.5) K = lambda x: 1/np.sqrt(2*np.pi)*np.exp(-x**2/2) fh = lambda x,h: 1/(len(Y)*h)*sum([K((x - Y[i])/h) for i in range(len(Y))]) x = np.linspace(0,15,100) plt.plot(x,fh(x,1),label='h = 1.0') plt.plot(x,fh(x,0.5),label='h = 0.5') plt.plot(x,fh(x,0.2),label='h = 0.2') plt.title('Kernel Density Functions (Gaussian)'), plt.legend(), plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/BkZbkCA6xe.png) ```python=+ plt.hist(Y,bins=np.arange(0,15,0.5),density=True,alpha=0.5) K = lambda x: 0.5*u(x) fh = lambda x,h: 1/(len(Y)*h)*sum([K((x - Y[i])/h) for i in range(len(Y))]) x = np.linspace(0,15,100) plt.plot(x,fh(x,0.5),label='h = 0.5') plt.plot(x,fh(x,0.1),label='h = 0.2') plt.title('Kernel Density Functions (Rectangular)'), plt.legend(), plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/SyszyC0all.png) ## KDE with SciPy The `scipy.stats` module provides a simplified way to perform KDE using the Gaussian kernel via the [`scipy.stats.gaussian_kde`](https://docs.scipy.org/doc/scipy/tutorial/stats/kernel_density_estimation.html) function. * By default, `gaussian_kde` automatically computes an **optimal bandwidth** parameter based on established statistical methods. * You can manually specify the bandwidth using the `bw_method` parameter. The example below demonstrates using `gaussian_kde` on the same sum of exponential random variables, showing the fit for the automatically chosen bandwidth and two user-specified bandwidths. ```python= N = 1000 X1 = stats.expon.rvs(scale=1,size=N) X2 = stats.expon.rvs(scale=2,size=N) Y = X1 + X2 plt.hist(Y,bins=np.arange(0,15,0.5),density=True,alpha=0.5) kde1 = stats.gaussian_kde(Y) kde2 = stats.gaussian_kde(Y,bw_method=1) kde3 = stats.gaussian_kde(Y,bw_method=0.1) x = np.linspace(0,15,100) plt.plot(x,kde1(x),label='h = {:.2f}'.format(kde1.factor)) plt.plot(x,kde2(x),label='h = 1.00') plt.plot(x,kde3(x),label='h = 0.10') plt.title('Kernel Density Functions (Gaussian)'), plt.legend(), plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/BJXd10R6gx.png)