# Statistics with Meteorological Applications
>[name=Po-Hui Yeh] [time=Apr 13, 2024]
Content
---
[TOC]
## Lecture 1 (02/20): Introduction
- story telling
- remove trend
- remove seasonal cycle
- remove diurnal cycle
- TREND (linear trend)
- Climatology of Seasonal cycle
- Climatology od Diurnal cycle
### Statistic
- 集中統計量 (characteristics)
- **3M**: Mean, Median, Mode
> **Mean**$$
> \bar{x}=\sum^n_{i=1}\frac{x_i}{n}
> $$**Median**: middle number
> **Mode**: the number appear the most
- 位置統計量 (relative location)
- Percentile
- 變異統計量 (how difference)
- Range, Variance, STD
## Lecture 2 (02/27): Population & Sample
> Population: whole group with common characteristic
> Sample: individuals extracted from population
> Sampling: method of extraction
### Sampling Method
> Before calculation, we should ask: where the data come from?
> Sample is not equal to the population, uncertainty should be provided
#### 簡單隨機抽樣 (Simple Random Sampling)
- basic, all element is independent (no relate between each selected element)
- For population with size $N$, probability for selection is $\frac{1}{N}$
- For population with size $\infty$, probability for selection followed probability distribution of population
#### 系統抽樣(Systematic sampling)
- seperating the population to a few groups, and selecting followed the same rule
1. 循環抽取法(Cycling: get a number first then group)
2. 整除選取法(group first, then select individuals in each group)
#### 分層抽樣(Stratified sampling)
For population with strong difference. Averaging for each seperated level first than conducting whole averaging.
#### 整群/部落抽樣(Cluster sampling)
Seperating population to a few random groups. Select some of then to investigate. (differences in selected group should represent population)
## Lecture 3 (02/27): Expectation & Variance
### Basic concepts
$$
\begin{align}
\mu &\equiv E[X]=\sum_{i\in S}x_i\text{Pr}(x_i)\\
\sigma^2 &\equiv Var[X]\equiv E[(X-\mu)^2]=\sum_{i\in S}(x_i-\mu)^2\text{Pr}(x_i)\\
&= E[X^2]-(E[X])^2
\end{align}
$$
> $$
> \begin{align}
> E[(aX+b)]&=aE(X)+b\\
> E[X+Y]&=E[X]+E[Y]\\
> E[XY]&=E[X]E[Y]\text{, for independent only}\\
> Var[(aX+b)]&=a^2Var(X)\\
> Var[X+Y]&=Var[X]+Var[Y]\text{, for independent only}
> \end{align}
> $$
### Sample mean
| Type | Population (fixed) | Sample (random) |
|:--------:| :-------------------------------------------- | :-------------------------------------------------------- |
| Mean | $\mu=\sum^N_{i=1}\frac{x_i}{N}$ | $\bar{x} = \sum^n_{i=1}\frac{x_i}{n}$ |
| Variance | $\sigma^2=\sum^N_{i=1}\frac{(x_i-\mu)^2}{N}$ | $s^2 = \sum^n_{i=1}\frac{(x_i-\bar{x})^2}{n-\textbf{1}}$ |
> **Why $n-1$?**
> sample variance will underestimate
## Lecture 4 (03/05): Sampling
| Type | Sample (random) | Sample Mean (Const.) |
| :--------: | --------------------------------------------------- | ------------------------------------------------------- |
| Mean | $\bar{x} =\frac{1}{n}\sum^{n}_{i=1}X_i$ | $E[\bar{x}] =\mu_{\bar{x}}=\mu$ |
| Variance | $s^2 = \frac{1}{n-1}\sum^{n}_{i=1}(X_i -\bar{x})^2$ | $Var[\bar{x}] = \sigma_{\bar{x}}^2= \frac{\sigma^2}{n}$ |
| STD | | $\sigma_{\bar{x}}= \frac{\sigma}{\sqrt{n}}$ |
> $\sigma_{\bar{x}}$ is also **standard error** of the sample mean
> [Experiment](http://onlinestatbook.com/stat_sim/sampling_dist/index.html)
### Probability Theory
#### PDF
describing the characteristics of probability distribution.
#### Normal distribution (ND)
most value is close to the center
With mean and STD:
$$
\begin{align}
&f(x\mid \mu,\sigma) = \frac{e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}}{\sigma\sqrt{2\pi}}, x\in (-\infty,\infty)\\
&X\sim \text{ND}(\mu,\sigma)\\
\end{align}
$$
#### Standard normal distribution
Standardized the parameter and use table to find the ratio.
$$
Z = \frac{X-\mu}{\sigma}\sim \text{ND}(0,1)
$$
> Z-Table: left area $P(Z\leq z)$
#### Sampling from normal distribution
$$
\bar{x}\sim \text{ND}\left(\mu,\frac{\sigma}{\sqrt{n}}\right)
$$
### Central Limit Theory
Sample mean will be approximately ND (no matter how population distributes), when sample size is large enough ($n\geq 25$).
#### Distribution of the Sample Variance from ND population
If $\text{DOF}=n-1$:
$$
\chi^2_{n-1} = \frac{\sum^{n}_{i=1}(x_i -\bar{x})^2}{\sigma^2} = \frac{(n-1)s^2}{\sigma^2}
$$
When sample size is large, distribution will close to ND.
### Mean and Variance of Sampling Mean
### Standard Error (std of sample mean)
> std for "how **discrete** the data"
> standard error for **error of estimation result**
>
$$
\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
$$
- Larger sample size, small standard error
- Standard error reflects sampling fluctuation
## Lecture 5 (03/12): Estimation
### Estimate population mean with known population variance
**Point estimation:** one value to estimate variable of population
**Interval estimation:** use point estimation to measure the property of sampling distribution (construct a interval for parameter of population)
> How to conduct 95% confidence level
> 1. Calculate std of sample mean: $\sigma_{\overline{X}}=\frac{\sigma}{\sqrt{n}}$
> 2. Derive the interval with known mean (leg is two times $\frac{\sigma}{\sqrt{n}}$)
#### Summary
- $a<\mu <b$, $a \And b$ is <font color="#f00">random</font> and determine by sampling distribution of $\bar{x}$.
- If sample $n$ increase, the $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$ becoming smaller. The estimation will be closer to the population $\mu$ (by smaller interval).
$$
P(a<\mu <b) = 1 - \alpha
$$
> Confidence level:$1-\alpha$
#### Confidence interval (CI)
$$
(\overline{X}- z_{\frac{\alpha}{2}}\times \frac{\sigma}{\sqrt{n}},\ \overline{X}+ z_{\frac{\alpha}{2}}\times \frac{\sigma}{\sqrt{n}})
$$
> Confidence interval = point estimation $\pm$ margin of error
Confidence level is not equal to the possibility for parameter in the CI. It's the percentage for constructing the interval covering the average value.
### Estimation population mean with unknown population variance
Replace population std ($\sigma$) with $s$ (sample std)?
#### Student's t-distribution
With degree of freedom $n-1$
Symmetric, centered at 0, bell-shaped.
$n$ goes up, close to ND
$$
(\overline{X}- t_{\frac{\alpha}{2}}\times \frac{s}{\sqrt{n}},\ \overline{X}+ t_{\frac{\alpha}{2}}\times \frac{s}{\sqrt{n}})
$$
- Use t-distribution only when population variance is unknown
- When $n>30$, some use standard normal distribution ($\because$ t-ditrbuton is smaller than standard normal distribution, but ==t-distribution is more accurate==
:::warning

:::
## Lecture 6,7,8 (03/19,03/26,04/02): Hypothesis Tset (I,II,III)
### hypothesis testing
1. Null Hypothesis ($H_0$): Assume it is true. If the data collected is nearly impossible when $H_0$ is true, decline the assumption.
2. Alternative Hypothesis ($H_a$ or $H_1$): it is usually what we goning to prove. The significant evidents is needed to establish it.
> a not declined $H_0$ does not means it is true (evidents not enough only)
> a decline means the evidents observe significance -> which decline $H_0$
### test of significance
> Step 1: Establish null hypothesis
> Step 2: Calculation
> $$
> Z =\frac{\bar{x}-\mu}{\sigma_{sample}/\sqrt{N}}
> $$ Step 3: Reject or not
> 1. interval $|\bar{x}-\mu|< d$
> 2. critical value $[\mu-d,\mu+d]$
> 3. Calculate p-value (reject when $p<\alpha$)
#### Type I Error & P-value
| Decision | $H_0$ is True | $H_0$ is False |
| ---------------- | ------------ | ------------- |
| Not reject $H_0$ | no error | **Type II error** |
| Reject $H_0$ | **Type I error** ($p<\alpha$) | no error |
##### Significant Level ($\alpha$)
> ratio that we make type I error
> Confidence level: CI$=1-\alpha$
>
$$
P(\text{Type I error}) = P(\text{reject }H_0|H_0\text{ is True})
$$
##### p-value
* The possibility that null hypothesis is **TRUE**. (The possibility of False is unknown)
* provided confident of rejection
* Significant level observed in testing.
* If $H_0$ is more reliable, the smaller p-value we needed.
* ==P-value is the area under curve:==
> summation of the area under curve with `t.cdf()` or `norm.cdf`
#### Two-Tailed Test vs One-Tailed Test
Two-tailede: in the center.
One-taild: smaller(left), larger(right)
> for the same p-value, the two-tailed need more significant
### For TWO or more population
Difference in **mean**:
$$\mu_1-\mu_2\text{ or }\overline{x_1}-\overline{x_2}$$
Standard Error (SE)
#### ND, large sample ($>30$)
\begin{align}
E(\overline{X}-\overline{Y})=\mu_{\overline{X}-\overline{Y}} = \mu_{\overline{X}}-\mu_{\overline{Y}} = \mu_X-\mu_Y\\
Var(\overline{X}-\overline{Y})=\sigma^2_{\overline{X}-\overline{Y}} = \sigma^{2}_{\overline{X}}-\sigma^2_{\overline{Y}} = \frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}\\
(\overline{X}-\overline{Y})\sim ND(\mu_X-\mu_Y,\sqrt{\frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}} )
\end{align}
#### ND, small sample ($<30$, t-distribution)
1. population is ND, known population variance:
$$(\overline{X}-\overline{Y})\sim ND(\mu_X-\mu_Y,\sqrt{\frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}} )$$
2. Unknown population variance
* same variance: pooled sample variance (dof = $n+m-2$)
$$\begin{gather}
S_p^2=\frac{(n-1)s_X^2+(m-1)s_Y^2}{n+m-2}\\
s_X^2=\frac{\sum_{i=1}^n(X_i-\overline{X})^2}{n-1}\land s_Y^2=\frac{\sum_{i=1}^m(Y_i-\overline{Y})^2}{m-1}\\
\frac{\overline{X}-\overline{Y}}{\sqrt{S_P^2\left(n_X^{-1}+n_Y^{-1}\right)}}\sim t(n_X+n_Y-2)
\end{gather}$$
* different variance: *Welch*
$$\begin{gather}
S=\sqrt{\frac{{S_1}^2}{n_1}+\frac{{S_2}^2}{n_2}},\ DOF=\frac{\left(\frac{{S_1}^2}{n_1}+\frac{{S_2}^2}{n_2}\right)^2}{\frac{\left({S_1}^2/n_1\right)^2}{n_1-1}+\frac{\left({S_2}^2/n_2\right)^2}{n_2-1}}\\
\frac{\overline{X}-\overline{Y}}{\sqrt{\frac{{S_1}^2}{n_1}+\frac{{S_2}^2}{n_2}}}\sim t\left(DOF\right)
\end{gather}$$
> **Example**
> How to determine the temperature difference between two cycle?
> If the differdnce is small, we can **remove seasonal cycle**. If the **correlation** is established, it is owing to seasonal effect.
### Paired and unpaired test
1. Independent (unpaired) Sample t-test: testing significant differences
2. Paired Samples t-test: testing difference between mean
In atmospheric field: **auto-correlation**
### Test two population variance (F-test)
:::warning

:::
#### 卡方分配
for $n-1$ degree (Distribution of sample variance)
Closer to ND with larger $n$ (sample size)
\begin{align}
\chi^2_{n-1} = \frac{\sum_{i=1}^n(x_i-\bar{x})^2}{\sigma^2} = \frac{(n-1)s^2}{\sigma^2}
\end{align}
> $s$: sample variance; $\sigma$: population variance
* If $\alpha=0.5$: Deny $H_0$ for $\chi^2>16.9$ (One-tailed), for $\chi^2\notin[16.1,45.7]$ (Two-tailed)
#### F distribution for two population
\begin{align}
F = \frac{\chi_1^2/(n-1)}{\chi_2^2/(m-1)} = \frac{s_1^2\sigma_2^2}{s_2^2\sigma_1^2}
\end{align}
> F distribution with $n-1$, $m-1$ dof
Notice: if freedom is inversed, the distribution is different.
> $$
> F_{(1-\alpha),df_1,df_2} = \left(F_{\alpha,df_1,df_2}\right)^{-1}
> $$
> Some textbooks suggest the smaller sample variance is from the population 2 so that the F value is always larger than 1
## Lecture 9 (04/02,07) Regression
### Correlation & Covariance
#### Covariance
\begin{align}
\text{Var}(X) &= \frac{\sum_{i=1}^N(x_i-\bar{x})^2}{N}\\
\text{Cov}(X,Y) = \sigma_{xy} &= \frac{\sum_{i=1}^N(x_i-\bar{x})(y_i-\bar{y})}{N}
\end{align}
> sign represents tendency:
> $+$ quadrant 1,3
> $-$ quadrant 2,4
>
* deviation of two variables
* variance is special case (when two variables are the same)
* it **CANNOT** guarantee a causal relationship [因果關係]
#### Correlation
> ONLY how linearly dependent
##### Correlation Coe
- compare covariance with variability of variables
$$
r\equiv {\sigma_{xy}\over\sigma_x\sigma_y}
$$
> $r \in [-1,1]$
> $>0$: positive correlation; $<0$: negative correlation; $=0$: no correlation
### Regression Analysis
Linear Regression: predict or evaluate with variables
#### Linear Estimation
$$
y=ax+b
$$
> y: predictand, dependent variables, response variables
> x: predictors, independent variables, explanatory variables
- Simple or multiple linear regression depend on amount of independent variables
- Error / Residuals: $\epsilon_i=y_i-(ax_i+b)$
#### Least square approx.
> the best result is unique
>
$$
\text{Min }S_r = \sum^n_{i=1}\epsilon^2_i=\sum^n_{i=1} (y_i-ax_i-b)^2
$$
$a={n\sum x_iy_i-\sum x_i\sum y_i\over{ n\sum x_i^2}-(\sum x_i)^2}$, $b=y-ax$
#### Multiple linear regression
> for more than two independent variables.
$$
\text{Var(X)} = {\text{spread of residuals}\over{\text{spread of }x}}={\sigma^2\over{\sum(x_i-\bar{x})^2}} = {{SSE\over{n-2}}\over{\sum(x_i-\bar{x})^2}}
$$
> use SSE for estimation because $\sigma$ is unknown
Overall quality: by $R^2$
Significant of regression COE [Lecture 10]
#### Model Validation (SSE, SST, $R^2$)
$$
SST = SSR+SSE
$$
> SST 總平方和
> SSR 回歸值平方和
> SSE 殘差平方和 $=\sum_{i=1}^n(y_i-\widehat{y_i})^2$
> 
##### Determination of coe ($R^2$)
> how many percent of variation in y is reduced by using x
> Notice: **NO** causation between x,y
>
$$
R^2 = {SSR\over SST} = 1- {SSE\over SST}
$$
> $R^2=1$: perfect
> $R^2=r^2$ when only one independant variable
## Lecture 10 (04/07) Significant test for regression COE
### Inference in Linear Regression
population: $y=\beta_0+\beta_1x$
estimate: $\widehat{y}=\widehat{\beta_0}+\widehat{\beta_1}x$
> Mean of sampling distribution of $\widehat{\beta_0}$, $\widehat{\beta_1}$ is $\beta_0$, $\beta_1$
> standard error of $\widehat{\beta_0}$, $\widehat{\beta_1}$ is related to SSE
Estimate variance:
\begin{align}
\text{Var}(\hat{\beta_1}) &= {{SSE\over{n-2}}\over{\sum(x_i-\bar{x})^2}}\\
SE_{\hat{\beta_1}}&=\frac{s}{\sqrt{s_{xx}}}=\frac{s}{\sqrt{\sum(x_i-\bar{x})^2}}
\end{align}
#### Hypothesis testing for regression model COE
Null ($H_0$): There is **no** relationship between X and Y
Alternative ($H_1$): There is **some** relationship between X and Y
> $H_0$: $\beta_1 =0$; $H_1$: $\beta_1 \neq 0$
How far we need to prove $\beta_1 \neq 0$? Depend on **SE**. (larger SE for larger differdnces)
> **t-statistic**: $t={\hat{\beta_1}-0\over{SE(\hat{\beta_1})}}$
Assume $\beta_1 =0$, the **p-value** can be calculated from t-value (smaller p-value for association existance)
Thus, for lots of variables, we can use **correlation matrix**.
### Practice of regression map and correlation map
regression map: $\beta_1$
correlation map: $r$