# Probability Distributions
```python
import numpy as np
import matplotlib.pyplot as plt
import scipy.special as sps
import pandas as pd
```
## Variables and Densities
### Random Variables
A **random variable** is a function that assigns a numerical value to the outcome of a random experiment (e.g., a coin flip or measuring wind speed).
| Type | Definition | Example |
| :---: | :--- | :--- |
| **Discrete** | The range of possible values is finite or countable (e.g., integers). | The total number of points scored in a basketball game (a positive integer). |
| **Continuous** | The range of possible values is uncountably infinite (e.g., real numbers, $\mathbb{R}$). | Temperature measured at a specific time and location (positive real numbers). |
### Probability Distributions and Density Functions
A **probability density function (PDF)**, denoted $f_X(x)$, is a non-negative function ($f(x) \ge 0$ for all $x \in \mathbb{R}$) used to define the probability distribution of a continuous random variable $X$.
The probability that $X$ falls within an interval $[a, b]$ is given by the integral of the PDF over that interval:
$$
P(a \le X \le b) = \int_a^b f(x) \, dx
$$
For $f(x)$ to be a valid PDF, the total area under the curve must equal 1:$$\int_{-\infty}^{\infty} f(x) \, dx = 1$$
### Mean and Variance
Given a continuous random variable $X$ with PDF $f_X(x)$:
* **Mean $\mu$**: Describes the central value of the distribution.
$$
\mu = \int_{-\infty}^{\infty} x f_X(x) \, dx
$$
* **Variance $\sigma^2$**: Describes how spread out the distribution is; it's the average squared distance from the mean.
$$
\sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f_X(x) \, dx
$$
### Scaling and Shifting PDFs
Transformations of a PDF $f(x)$ maintain its property as a valid density function:
* **Shifting**: The shifted function $f(x - b)$ is a valid PDF for any shift $b \in \mathbb{R}$. The integral remains 1 because $f(x) \ge 0$ and the area under the curve is simply translated.
* **Scaling**: The scaled function $a f(ax)$ is a valid PDF for any scaling factor $a > 0$. The scaling factor $a$ ensures the total integral remains 1, compensating for the compression or stretching of the $x$-axis.
### Parameter Estimation
When you have a sample of $N$ observed values ($x_1, \dots, x_N$) of a random variable $X$, you can estimate its true mean and variance:
* **Sample Mean $\hat{\mu}$**: An estimate of the mean μ.
$$
\hat{\mu} = \frac{1}{N} \sum_{i=1}^N x_i
$$
* **Sample Variance $\hat{\sigma}^2$:** The unbiased sample estimate of the variance $\sigma^2$.
$$
\hat{\sigma}^2 = \frac{1}{N - 1} \sum_{i=1}^N (x_i - \hat{\mu})^2
$$
## Normal Distribution
### Density Function
The standard normal distribution has a mean $\mu$ of 0 and a variance $\sigma^2$ of 1, and is defined by the probability density function (PDF):
$$
f(x) = \frac{1}{\sqrt{2 \pi}} e^{-x^2/2}
$$
The constant $\frac1{\sqrt{2\pi}}$ ensures that the total area under the curve is 1, a result derived from the Gaussian integral formula: $\int_0^{\infty} e^{-x^2} dx = \frac{\sqrt{\pi}}{2}$.
#### General Normal Distribution
If the standard normal PDF is scaled by $1/\sigma$ and shifted by $\mu$, the resulting function is the normal distribution, denoted $N(\mu,\sigma^2)$, with mean $\mu$ and variance $\sigma^2$:
$$
f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-(x - \mu)^2/2\sigma^2}
$$
```python=
normal = lambda x,mu,sigma: 1/(sigma*np.sqrt(2*np.pi))*np.exp(-(x - mu)**2/(2*sigma**2))
```
#### Effect of Parameters
The parameters $\mu$ and $\sigma^2$ control the position and spread of the distribution:
* **Mean $\mu$**: Increasing the value of $\mu$ simply shifts the entire curve along the $x$-axis.
* **Variance $\sigma^2$**: Increasing the value of $\sigma^2$ (or $\sigma$, the standard deviation) causes the curve to become flatter and wider, indicating greater spread in the data.
```python=+
x = np.linspace(-5,10,200)
plt.figure(figsize=(10,4))
for mu in range(0,6):
y = normal(x,mu,1)
plt.plot(x,y)
plt.title("Normal Distribution $N(\mu,\sigma^2)$ for $\sigma=1$")
plt.legend(['$\mu=0$','$\mu=1$','$\mu=2$','$\mu=3$','$\mu=4$'])
plt.grid(True)
plt.show()
```

```python=+
x = np.linspace(-10,10,200)
plt.figure(figsize=(10,4))
for sigma2 in range(1,6):
sigma = np.sqrt(sigma2)
y = normal(x,0,sigma)
plt.plot(x,y)
plt.title('Normal Distribution $N(\mu,\sigma^2)$ for $\mu=0$')
plt.legend(['$\sigma^2=1$','$\sigma^2=2$','$\sigma^2=3$','$\sigma^2=4$','$\sigma^2=5$'])
plt.grid(True)
plt.show()
```

### Example: Temperature Distribution
The normal distribution is often a good fit for naturally occurring data, like temperature. The following analysis uses daily average temperature data measured at the Vancouver Airport from 1995 to 2023.
```python=
df = pd.read_csv('temperature.csv')
df.head()
```
```
day month year dayofyear avg_temperature
0 13 4 2023 103 7.10
1 12 4 2023 102 5.19
2 11 4 2023 101 8.00
3 10 4 2023 100 7.69
4 9 4 2023 99 9.30
```
#### Data Visualization and Subsetting
A scatter plot of average temperature versus the day of the year shows temperature variation throughout the year. To focus on a stable subset, we examine the distribution of temperatures specifically in July.
```python=+
df.plot('dayofyear','avg_temperature',kind='scatter',alpha=0.1,lw=0,figsize=(10,4))
plt.ylim([-15,35])
plt.xlabel('Day of the Year'), plt.ylabel('Temperature (Celsius)')
plt.title('Average Daily Temperature (1995-2023)')
plt.grid(True)
plt.show()
```

The histogram of July temperatures suggests a distribution that can be approximated by a normal curve.
```python=+
temp7 = df[df['month'] == 7]['avg_temperature']
temp7.hist(bins=40)
plt.show()
```

#### Parameter Estimation and Fit
We estimate the mean and variance of the July temperature data using the sample mean $\hat{\mu}$ and sample variance $\hat{\sigma}^2$:
* Sample mean $\hat{\mu}$ $\approx 18.34$
* Sample variance $\hat{\sigma}^2$ $\approx 4.00$
The estimated normal distribution $N(18.34, 4.00)$ can then be plotted over the histogram for comparison.
```python=+
mu = temp7.mean()
sigma2 = temp7.var()
print('mean =',mu,', variance =',sigma2)
```
```
mean = 18.33942652329749 , variance = 4.003297756855476
```
```python=+
temp7.hist(bins=40,density=True)
x = np.linspace(10,30,200)
y = normal(x,mu,sigma2**.5)
plt.plot(x,y)
plt.show()
```

## Uniform Distribution
The uniform distribution, denoted $U(a,b)$, describes a continuous random variable $X$ where all outcomes are equally likely within a specified interval $[a, b]$.
### Density Function
The probability density function (PDF) for the uniform distribution, with parameters $a < b$, is constant over the interval and zero elsewhere:
$$
f(x) = \left\{ \begin{array}{ccc} \displaystyle \frac{1}{b-a} & , & a \le x \le b \\ 0 & , & \text{otherwise} \end{array} \right.
$$
The height of the rectangle, $\frac{1}{b-a}$, ensures that the total area under the PDF (which is the probability) is equal to 1.
### Mean and Variance
The mean and variance of the uniform distribution are directly determined by the parameters $a$ and $b$:
* **Mean $\mu$**: The average value is the midpoint of the interval. $$\mu = \frac{b+a}{2}$$
* **Variance $\sigma^2$**: $$\sigma^2 = \frac{(b-a)^2}{12}$$
```python=
plt.figure(figsize=(10,4))
x = np.linspace(-2,6,1000)
uniform = lambda x,a,b: 1/(b-a)*(np.heaviside(x-a,1) - np.heaviside(x-b,1))
for a,b in [(0,1),(1,3),(2,5),(-1,4)]:
y = uniform(x,a,b)
plt.plot(x,y)
plt.title('Uniform Distribution $U(a,b)$')
plt.legend(['$a=0,b=1$','$a=1,b=3$','$a=2,b=5$','$a=-1,b=4$'])
plt.grid(True)
plt.show()
```

## Exponential Distribution
The **exponential distribution**, denoted $\text{Exp}(\lambda)$, is typically used to model the waiting time until an event occurs or the lifetime of a device. It is characterized by a single rate parameter, $\lambda$.
### Density Function
The probability density function (PDF) for the exponential distribution is defined for non-negative values ($x \ge 0$):
$$
f(x) = \left\{ \begin{array}{ccc} \lambda e^{- \lambda x} & , & x \ge 0 \\ 0 & , & x < 0 \end{array} \right.
$$
```python=
exponential = lambda x,lam: lam*np.exp(-lam*x)*np.heaviside(x,1)
```
### Mean and Variance
The mean $\mu$ and variance $\sigma^2$ are simple functions of the rate parameter $\lambda$:
* **Mean $\mu$**: The expected waiting time or average value. $$\mu = \frac{1}{\lambda}$$
* **Variance $\sigma^2$**: $$\sigma^2 = \frac{1}{\lambda^2}$$
```python=+
plt.figure(figsize=(10,4))
x = np.linspace(-1,4,1000)
for lam in [.25,.5,1,2]:
y = exponential(x,lam)
plt.plot(x,y)
plt.title('Exponential Distribution $Exp(\lambda)$')
plt.legend(['$\lambda=1/4$','$\lambda=1/2$','$\lambda=1$','$\lambda=2$'])
plt.grid(True)
plt.show()
```

As shown in the plot, increasing the value of $\lambda$ causes the distribution to decay more rapidly and become concentrated closer to the origin.
### Example: Precipitation Data
The exponential distribution is often a good model for the magnitude of natural events, such as rainfall. This example uses daily precipitation data from the Vancouver Airport (1995–2023).
```python=
df = pd.read_csv('precipitation.csv')
df.head()
```
```
day month year dayofyear precipitation
0 13 4 2023 103 0.0
1 12 4 2023 102 0.0
2 11 4 2023 101 6.2
3 10 4 2023 100 0.0
4 9 4 2023 99 9.1
```
#### Fitting the Data
1. **Filter and Shift**: The histogram of daily precipitation data is zero-inflated (many days have $0 \text{mm}$ of rain). To fit the exponential distribution, we focus on days with a minimum of $2 \text{mm}$ of precipitation and then shift the data by subtracting $2 \text{mm}$ to model the excess precipitation.
```python=+
df['precipitation'].hist(bins=np.arange(0,40.5,0.5),density=True)
plt.xlabel('Frequency'), plt.ylabel('Precipitation (mm)')
plt.title('Daily Precipitation (1995-2023)')
plt.grid(True)
plt.show()
```

```python=+
df = df[df['precipitation'] > 2]
df['precipitation'].hist(bins=np.arange(0,40.5,0.5),density=True)
plt.ylabel('Precipitation (mm)'), plt.xlabel('Frequency')
plt.title('Days with Precipitation above 2mm (1995-2023)')
plt.grid(True)
plt.show()
```

```python=+
df['precipitation2'] = df['precipitation'] - 2
df['precipitation2'].hist(bins=np.arange(0,40.5,0.5),density=True)
plt.xlabel('Precipitation in excess of 2mm (mm)'), plt.ylabel('Frequency')
plt.title('Days with Precipitation above 2mm (1995-2023)')
plt.grid(True)
plt.show()
```

2. **Estimate Parameter**: The sample mean $\hat{\mu}$ of the shifted data (precipitation in excess of $2 \text{mm}$) is calculated. This provides an estimate for the exponential parameter $\lambda$: $$\hat{\mu} \approx 8.065 \text{ mm}, \quad \lambda = \frac{1}{\hat{\mu}} \approx 0.124$$
```python=+
mu = df['precipitation2'].mean()
sigma2 = df['precipitation2'].var()
print('mean =',mu,', variance =',sigma2)
```
```
mean = 8.064954486345904 , variance = 71.13641043695192
```
```python=+
lam = 1/mu
print('lambda =',lam)
```
```
lambda = 0.12399326018429686
```
3. **Model Comparison:** The resulting exponential PDF is then plotted against the histogram of the original precipitation data (shifted back by $2 \text{mm}$) to evaluate the goodness of fit.
```python=+
df['precipitation2'].hist(bins=np.arange(0,40.5,0.5),density=True)
x = np.linspace(0,40,200)
y = exponential(x,lam)
plt.plot(x,y)
plt.xlabel('Precipitation in excess of 2mm (mm)'), plt.ylabel('Frequency')
plt.title('Days with Precipitation above 2mm (1995-2023)')
plt.grid(True)
plt.show()
```

```python=+
df['precipitation'].hist(bins=np.arange(0,40.5,0.5),density=True)
x = np.linspace(0,40,400)
y = exponential(x-2,lam)
plt.plot(x,y)
plt.title('Days with Precipitation above 2mm (1995-2023)')
plt.xlabel('Precipitation (mm)'), plt.ylabel('Frequency')
plt.grid(True)
plt.show()
```

## Gamma Distribution
The gamma distribution, denoted $\Gamma(\alpha,\beta)$, is a flexible distribution defined for non-negative values and characterized by two parameters: the shape parameter $\alpha$ and the rate parameter $\beta$. It's commonly used to model non-negative, continuous variables, particularly those that are skewed, such as waiting times or insurance claims.
### Density Function
The probability density function (PDF) for the gamma distribution is:
$$
f(x) = \left\{ \begin{array}{ccc} \displaystyle \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} & , & x \ge 0 \\ 0 & , & x < 0 \end{array} \right.
$$
where $\Gamma(\alpha)$ is the [Gamma function](https://en.wikipedia.org/wiki/Gamma_function) (a generalization of the factorial function to real and complex numbers).
```python=
gamma = lambda x,alpha,beta: beta**alpha/sps.gamma(alpha)*x**(alpha - 1)*np.exp(-beta*x)*np.heaviside(x,1)
```
#### Relationship to Exponential Distribution
Note that the exponential distribution is a special case of the gamma distribution: $\text{Exp}(\lambda) = \Gamma(1,\lambda)$.
### Mean and Variance
The mean $\mu$ and variance $\sigma^2$ are expressed in terms of $\alpha$ and $\beta$:
* **Mean $\mu$**: $$\mu = \frac{\alpha}{\beta}$$
* **Variance $\sigma^2$**: $$\sigma^2 = \frac{\alpha}{\beta^2}$$
#### Estimating Parameters
If you have estimates for the sample mean $\hat{\mu}$ and sample variance $\hat{\sigma}^2$ from observed data, you can solve for the distribution parameters:
$$
\alpha = \frac{\mu^2}{\sigma^2} \quad \text{and} \quad \beta = \frac{\mu}{\sigma^2}
$$
```python=+
plt.figure(figsize=(10,4))
x = np.linspace(-1,8,1000)
for alpha,beta in [(2,1),(3,2),(7,4),(7,2)]:
y = gamma(x,alpha,beta)
plt.plot(x,y)
plt.title('Gamma Distribution $\Gamma(a b)$')
plt.legend(['$a=2,b=1$','$a=3,b=2$','$a=7,b=4$','$a=7,b=2$'])
plt.grid(True)
plt.show()
```

### Example: Wind Speed Distribution
The gamma distribution is often a suitable model for wind speed, which is a non-negative quantity that tends to be positively skewed. This example uses average daily wind speed data from the Vancouver Airport (1995–2023).
```python=
df = pd.read_csv('wind.csv')
df.head()
```
```
day month year dayofyear avg_wind_speed
0 13 4 2023 103 13.0
1 12 4 2023 102 13.0
2 11 4 2023 101 17.5
3 10 4 2023 100 11.5
4 9 4 2023 99 19.5
```
#### Data Visualization
The histogram of the wind speed data shows a distribution concentrated near zero and skewed to the right, which is characteristic of the gamma distribution.
```python=+
df['avg_wind_speed'].hist(bins=range(40),density=True)
plt.ylabel('Frequency'), plt.xlabel('Wind Speed (m/s)')
plt.title('Average Daily Wind Speed (1995-2023)')
plt.grid(True)
plt.show()
```

To fit a gamma distribution, you would calculate the sample mean and sample variance of the wind speed data and use the formulas above to estimate the parameters $\alpha$ and $\beta$.
```python=+
mu = df['avg_wind_speed'].mean()
sigma2 = df['avg_wind_speed'].var()
print('mean =',mu,', variance =',sigma2)
```
```
mean = 13.9063 , variance = 33.23244355435653
```
```python=+
alpha = mu**2/sigma2
beta = mu/sigma2
df['avg_wind_speed'].hist(bins=range(40),density=True)
x = np.linspace(0,40,200)
y = gamma(x,alpha,beta)
plt.plot(x,y)
plt.xlabel('Wind Speed (m/s)'), plt.ylabel('Frequency')
plt.title('Average Daily Wind Speed (1995-2023)')
plt.grid(True)
plt.show()
```
