Distribution - HackMD

# Probability Distributions ```python import numpy as np import matplotlib.pyplot as plt import scipy.special as sps import pandas as pd ``` ## Variables and Densities ### Random Variables A **random variable** is a function that assigns a numerical value to the outcome of a random experiment (e.g., a coin flip or measuring wind speed). | Type | Definition | Example | | :---: | :--- | :--- | | **Discrete** | The range of possible values is finite or countable (e.g., integers). | The total number of points scored in a basketball game (a positive integer). | | **Continuous** | The range of possible values is uncountably infinite (e.g., real numbers, $\mathbb{R}$). | Temperature measured at a specific time and location (positive real numbers). | ### Probability Distributions and Density Functions A **probability density function (PDF)**, denoted $f_X(x)$, is a non-negative function ($f(x) \ge 0$ for all $x \in \mathbb{R}$) used to define the probability distribution of a continuous random variable $X$. The probability that $X$ falls within an interval $[a, b]$ is given by the integral of the PDF over that interval: $$ P(a \le X \le b) = \int_a^b f(x) \, dx $$ For $f(x)$ to be a valid PDF, the total area under the curve must equal 1:$$\int_{-\infty}^{\infty} f(x) \, dx = 1$$ ### Mean and Variance Given a continuous random variable $X$ with PDF $f_X(x)$: * **Mean $\mu$**: Describes the central value of the distribution. $$ \mu = \int_{-\infty}^{\infty} x f_X(x) \, dx $$ * **Variance $\sigma^2$**: Describes how spread out the distribution is; it's the average squared distance from the mean. $$ \sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f_X(x) \, dx $$ ### Scaling and Shifting PDFs Transformations of a PDF $f(x)$ maintain its property as a valid density function: * **Shifting**: The shifted function $f(x - b)$ is a valid PDF for any shift $b \in \mathbb{R}$. The integral remains 1 because $f(x) \ge 0$ and the area under the curve is simply translated. * **Scaling**: The scaled function $a f(ax)$ is a valid PDF for any scaling factor $a > 0$. The scaling factor $a$ ensures the total integral remains 1, compensating for the compression or stretching of the $x$-axis. ### Parameter Estimation When you have a sample of $N$ observed values ($x_1, \dots, x_N$) of a random variable $X$, you can estimate its true mean and variance: * **Sample Mean $\hat{\mu}$**: An estimate of the mean μ. $$ \hat{\mu} = \frac{1}{N} \sum_{i=1}^N x_i $$ * **Sample Variance $\hat{\sigma}^2$:** The unbiased sample estimate of the variance $\sigma^2$. $$ \hat{\sigma}^2 = \frac{1}{N - 1} \sum_{i=1}^N (x_i - \hat{\mu})^2 $$ ## Normal Distribution ### Density Function The standard normal distribution has a mean $\mu$ of 0 and a variance $\sigma^2$ of 1, and is defined by the probability density function (PDF): $$ f(x) = \frac{1}{\sqrt{2 \pi}} e^{-x^2/2} $$ The constant $\frac1{\sqrt{2\pi}}$ ensures that the total area under the curve is 1, a result derived from the Gaussian integral formula: $\int_0^{\infty} e^{-x^2} dx = \frac{\sqrt{\pi}}{2}$. #### General Normal Distribution If the standard normal PDF is scaled by $1/\sigma$ and shifted by $\mu$, the resulting function is the normal distribution, denoted $N(\mu,\sigma^2)$, with mean $\mu$ and variance $\sigma^2$: $$ f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-(x - \mu)^2/2\sigma^2} $$ ```python= normal = lambda x,mu,sigma: 1/(sigma*np.sqrt(2*np.pi))*np.exp(-(x - mu)**2/(2*sigma**2)) ``` #### Effect of Parameters The parameters $\mu$ and $\sigma^2$ control the position and spread of the distribution: * **Mean $\mu$**: Increasing the value of $\mu$ simply shifts the entire curve along the $x$-axis. * **Variance $\sigma^2$**: Increasing the value of $\sigma^2$ (or $\sigma$, the standard deviation) causes the curve to become flatter and wider, indicating greater spread in the data. ```python=+ x = np.linspace(-5,10,200) plt.figure(figsize=(10,4)) for mu in range(0,6): y = normal(x,mu,1) plt.plot(x,y) plt.title("Normal Distribution $N(\mu,\sigma^2)$ for $\sigma=1$") plt.legend(['$\mu=0$','$\mu=1$','$\mu=2$','$\mu=3$','$\mu=4$']) plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/ryRH5qRpge.png) ```python=+ x = np.linspace(-10,10,200) plt.figure(figsize=(10,4)) for sigma2 in range(1,6): sigma = np.sqrt(sigma2) y = normal(x,0,sigma) plt.plot(x,y) plt.title('Normal Distribution $N(\mu,\sigma^2)$ for $\mu=0$') plt.legend(['$\sigma^2=1$','$\sigma^2=2$','$\sigma^2=3$','$\sigma^2=4$','$\sigma^2=5$']) plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/rJKwqqC6ll.png) ### Example: Temperature Distribution The normal distribution is often a good fit for naturally occurring data, like temperature. The following analysis uses daily average temperature data measured at the Vancouver Airport from 1995 to 2023. ```python= df = pd.read_csv('temperature.csv') df.head() ``` ``` day month year dayofyear avg_temperature 0 13 4 2023 103 7.10 1 12 4 2023 102 5.19 2 11 4 2023 101 8.00 3 10 4 2023 100 7.69 4 9 4 2023 99 9.30 ``` #### Data Visualization and Subsetting A scatter plot of average temperature versus the day of the year shows temperature variation throughout the year. To focus on a stable subset, we examine the distribution of temperatures specifically in July. ```python=+ df.plot('dayofyear','avg_temperature',kind='scatter',alpha=0.1,lw=0,figsize=(10,4)) plt.ylim([-15,35]) plt.xlabel('Day of the Year'), plt.ylabel('Temperature (Celsius)') plt.title('Average Daily Temperature (1995-2023)') plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/HJXZo5Cael.png) The histogram of July temperatures suggests a distribution that can be approximated by a normal curve. ```python=+ temp7 = df[df['month'] == 7]['avg_temperature'] temp7.hist(bins=40) plt.show() ``` ![image](https://hackmd.io/_uploads/H1dfj5Cpge.png) #### Parameter Estimation and Fit We estimate the mean and variance of the July temperature data using the sample mean $\hat{\mu}$ and sample variance $\hat{\sigma}^2$: * Sample mean $\hat{\mu}$ $\approx 18.34$ * Sample variance $\hat{\sigma}^2$ $\approx 4.00$ The estimated normal distribution $N(18.34, 4.00)$ can then be plotted over the histogram for comparison. ```python=+ mu = temp7.mean() sigma2 = temp7.var() print('mean =',mu,', variance =',sigma2) ``` ``` mean = 18.33942652329749 , variance = 4.003297756855476 ``` ```python=+ temp7.hist(bins=40,density=True) x = np.linspace(10,30,200) y = normal(x,mu,sigma2**.5) plt.plot(x,y) plt.show() ``` ![image](https://hackmd.io/_uploads/HJ1HjcApge.png) ## Uniform Distribution The uniform distribution, denoted $U(a,b)$, describes a continuous random variable $X$ where all outcomes are equally likely within a specified interval $[a, b]$. ### Density Function The probability density function (PDF) for the uniform distribution, with parameters $a < b$, is constant over the interval and zero elsewhere: $$ f(x) = \left\{ \begin{array}{ccc} \displaystyle \frac{1}{b-a} & , & a \le x \le b \\ 0 & , & \text{otherwise} \end{array} \right. $$ The height of the rectangle, $\frac{1}{b-a}$, ensures that the total area under the PDF (which is the probability) is equal to 1. ### Mean and Variance The mean and variance of the uniform distribution are directly determined by the parameters $a$ and $b$: * **Mean $\mu$**: The average value is the midpoint of the interval. $$\mu = \frac{b+a}{2}$$ * **Variance $\sigma^2$**: $$\sigma^2 = \frac{(b-a)^2}{12}$$ ```python= plt.figure(figsize=(10,4)) x = np.linspace(-2,6,1000) uniform = lambda x,a,b: 1/(b-a)*(np.heaviside(x-a,1) - np.heaviside(x-b,1)) for a,b in [(0,1),(1,3),(2,5),(-1,4)]: y = uniform(x,a,b) plt.plot(x,y) plt.title('Uniform Distribution $U(a,b)$') plt.legend(['$a=0,b=1$','$a=1,b=3$','$a=2,b=5$','$a=-1,b=4$']) plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/S15E25Cael.png) ## Exponential Distribution The **exponential distribution**, denoted $\text{Exp}(\lambda)$, is typically used to model the waiting time until an event occurs or the lifetime of a device. It is characterized by a single rate parameter, $\lambda$. ### Density Function The probability density function (PDF) for the exponential distribution is defined for non-negative values ($x \ge 0$): $$ f(x) = \left\{ \begin{array}{ccc} \lambda e^{- \lambda x} & , & x \ge 0 \\ 0 & , & x < 0 \end{array} \right. $$ ```python= exponential = lambda x,lam: lam*np.exp(-lam*x)*np.heaviside(x,1) ``` ### Mean and Variance The mean $\mu$ and variance $\sigma^2$ are simple functions of the rate parameter $\lambda$: * **Mean $\mu$**: The expected waiting time or average value. $$\mu = \frac{1}{\lambda}$$ * **Variance $\sigma^2$**: $$\sigma^2 = \frac{1}{\lambda^2}$$ ```python=+ plt.figure(figsize=(10,4)) x = np.linspace(-1,4,1000) for lam in [.25,.5,1,2]: y = exponential(x,lam) plt.plot(x,y) plt.title('Exponential Distribution $Exp(\lambda)$') plt.legend(['$\lambda=1/4$','$\lambda=1/2$','$\lambda=1$','$\lambda=2$']) plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/SyWjT5RTgl.png) As shown in the plot, increasing the value of $\lambda$ causes the distribution to decay more rapidly and become concentrated closer to the origin. ### Example: Precipitation Data The exponential distribution is often a good model for the magnitude of natural events, such as rainfall. This example uses daily precipitation data from the Vancouver Airport (1995–2023). ```python= df = pd.read_csv('precipitation.csv') df.head() ``` ``` day month year dayofyear precipitation 0 13 4 2023 103 0.0 1 12 4 2023 102 0.0 2 11 4 2023 101 6.2 3 10 4 2023 100 0.0 4 9 4 2023 99 9.1 ``` #### Fitting the Data 1. **Filter and Shift**: The histogram of daily precipitation data is zero-inflated (many days have $0 \text{mm}$ of rain). To fit the exponential distribution, we focus on days with a minimum of $2 \text{mm}$ of precipitation and then shift the data by subtracting $2 \text{mm}$ to model the excess precipitation. ```python=+ df['precipitation'].hist(bins=np.arange(0,40.5,0.5),density=True) plt.xlabel('Frequency'), plt.ylabel('Precipitation (mm)') plt.title('Daily Precipitation (1995-2023)') plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/S1tOA5R6ee.png) ```python=+ df = df[df['precipitation'] > 2] df['precipitation'].hist(bins=np.arange(0,40.5,0.5),density=True) plt.ylabel('Precipitation (mm)'), plt.xlabel('Frequency') plt.title('Days with Precipitation above 2mm (1995-2023)') plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/r1DqCc0pgx.png) ```python=+ df['precipitation2'] = df['precipitation'] - 2 df['precipitation2'].hist(bins=np.arange(0,40.5,0.5),density=True) plt.xlabel('Precipitation in excess of 2mm (mm)'), plt.ylabel('Frequency') plt.title('Days with Precipitation above 2mm (1995-2023)') plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/r1I3RqR6gl.png) 2. **Estimate Parameter**: The sample mean $\hat{\mu}$ of the shifted data (precipitation in excess of $2 \text{mm}$) is calculated. This provides an estimate for the exponential parameter $\lambda$: $$\hat{\mu} \approx 8.065 \text{ mm}, \quad \lambda = \frac{1}{\hat{\mu}} \approx 0.124$$ ```python=+ mu = df['precipitation2'].mean() sigma2 = df['precipitation2'].var() print('mean =',mu,', variance =',sigma2) ``` ``` mean = 8.064954486345904 , variance = 71.13641043695192 ``` ```python=+ lam = 1/mu print('lambda =',lam) ``` ``` lambda = 0.12399326018429686 ``` 3. **Model Comparison:** The resulting exponential PDF is then plotted against the histogram of the original precipitation data (shifted back by $2 \text{mm}$) to evaluate the goodness of fit. ```python=+ df['precipitation2'].hist(bins=np.arange(0,40.5,0.5),density=True) x = np.linspace(0,40,200) y = exponential(x,lam) plt.plot(x,y) plt.xlabel('Precipitation in excess of 2mm (mm)'), plt.ylabel('Frequency') plt.title('Days with Precipitation above 2mm (1995-2023)') plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/ryxPJj0plx.png) ```python=+ df['precipitation'].hist(bins=np.arange(0,40.5,0.5),density=True) x = np.linspace(0,40,400) y = exponential(x-2,lam) plt.plot(x,y) plt.title('Days with Precipitation above 2mm (1995-2023)') plt.xlabel('Precipitation (mm)'), plt.ylabel('Frequency') plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/Skvdys06lg.png) ## Gamma Distribution The gamma distribution, denoted $\Gamma(\alpha,\beta)$, is a flexible distribution defined for non-negative values and characterized by two parameters: the shape parameter $\alpha$ and the rate parameter $\beta$. It's commonly used to model non-negative, continuous variables, particularly those that are skewed, such as waiting times or insurance claims. ### Density Function The probability density function (PDF) for the gamma distribution is: $$ f(x) = \left\{ \begin{array}{ccc} \displaystyle \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} & , & x \ge 0 \\ 0 & , & x < 0 \end{array} \right. $$ where $\Gamma(\alpha)$ is the [Gamma function](https://en.wikipedia.org/wiki/Gamma_function) (a generalization of the factorial function to real and complex numbers). ```python= gamma = lambda x,alpha,beta: beta**alpha/sps.gamma(alpha)*x**(alpha - 1)*np.exp(-beta*x)*np.heaviside(x,1) ``` #### Relationship to Exponential Distribution Note that the exponential distribution is a special case of the gamma distribution: $\text{Exp}(\lambda) = \Gamma(1,\lambda)$. ### Mean and Variance The mean $\mu$ and variance $\sigma^2$ are expressed in terms of $\alpha$ and $\beta$: * **Mean $\mu$**: $$\mu = \frac{\alpha}{\beta}$$ * **Variance $\sigma^2$**: $$\sigma^2 = \frac{\alpha}{\beta^2}$$ #### Estimating Parameters If you have estimates for the sample mean $\hat{\mu}$ and sample variance $\hat{\sigma}^2$ from observed data, you can solve for the distribution parameters: $$ \alpha = \frac{\mu^2}{\sigma^2} \quad \text{and} \quad \beta = \frac{\mu}{\sigma^2} $$ ```python=+ plt.figure(figsize=(10,4)) x = np.linspace(-1,8,1000) for alpha,beta in [(2,1),(3,2),(7,4),(7,2)]: y = gamma(x,alpha,beta) plt.plot(x,y) plt.title('Gamma Distribution $\Gamma(a b)$') plt.legend(['$a=2,b=1$','$a=3,b=2$','$a=7,b=4$','$a=7,b=2$']) plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/H1dYxsRple.png) ### Example: Wind Speed Distribution The gamma distribution is often a suitable model for wind speed, which is a non-negative quantity that tends to be positively skewed. This example uses average daily wind speed data from the Vancouver Airport (1995–2023). ```python= df = pd.read_csv('wind.csv') df.head() ``` ``` day month year dayofyear avg_wind_speed 0 13 4 2023 103 13.0 1 12 4 2023 102 13.0 2 11 4 2023 101 17.5 3 10 4 2023 100 11.5 4 9 4 2023 99 19.5 ``` #### Data Visualization The histogram of the wind speed data shows a distribution concentrated near zero and skewed to the right, which is characteristic of the gamma distribution. ```python=+ df['avg_wind_speed'].hist(bins=range(40),density=True) plt.ylabel('Frequency'), plt.xlabel('Wind Speed (m/s)') plt.title('Average Daily Wind Speed (1995-2023)') plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/BkwAeo0axg.png) To fit a gamma distribution, you would calculate the sample mean and sample variance of the wind speed data and use the formulas above to estimate the parameters $\alpha$ and $\beta$. ```python=+ mu = df['avg_wind_speed'].mean() sigma2 = df['avg_wind_speed'].var() print('mean =',mu,', variance =',sigma2) ``` ``` mean = 13.9063 , variance = 33.23244355435653 ``` ```python=+ alpha = mu**2/sigma2 beta = mu/sigma2 df['avg_wind_speed'].hist(bins=range(40),density=True) x = np.linspace(0,40,200) y = gamma(x,alpha,beta) plt.plot(x,y) plt.xlabel('Wind Speed (m/s)'), plt.ylabel('Frequency') plt.title('Average Daily Wind Speed (1995-2023)') plt.grid(True) plt.show() ``` ![image](https://hackmd.io/_uploads/r1uZ-jC6ge.png)