Statistics with Meteorological Applications

# Statistics with Meteorological Applications >[name=Po-Hui Yeh] [time=Apr 13, 2024] Content --- [TOC] ## Lecture 1 (02/20): Introduction - story telling - remove trend - remove seasonal cycle - remove diurnal cycle - TREND (linear trend) - Climatology of Seasonal cycle - Climatology od Diurnal cycle ### Statistic - 集中統計量 (characteristics) - **3M**: Mean, Median, Mode > **Mean**$$ > \bar{x}=\sum^n_{i=1}\frac{x_i}{n} > $$**Median**: middle number > **Mode**: the number appear the most - 位置統計量 (relative location) - Percentile - 變異統計量 (how difference) - Range, Variance, STD ## Lecture 2 (02/27): Population & Sample > Population: whole group with common characteristic > Sample: individuals extracted from population > Sampling: method of extraction ### Sampling Method > Before calculation, we should ask: where the data come from? > Sample is not equal to the population, uncertainty should be provided #### 簡單隨機抽樣 (Simple Random Sampling) - basic, all element is independent (no relate between each selected element) - For population with size $N$, probability for selection is $\frac{1}{N}$ - For population with size $\infty$, probability for selection followed probability distribution of population #### 系統抽樣(Systematic sampling) - seperating the population to a few groups, and selecting followed the same rule 1. 循環抽取法(Cycling: get a number first then group) 2. 整除選取法(group first, then select individuals in each group) #### 分層抽樣(Stratified sampling) For population with strong difference. Averaging for each seperated level first than conducting whole averaging. #### 整群/部落抽樣(Cluster sampling) Seperating population to a few random groups. Select some of then to investigate. (differences in selected group should represent population) ## Lecture 3 (02/27): Expectation & Variance ### Basic concepts $$ \begin{align} \mu &\equiv E[X]=\sum_{i\in S}x_i\text{Pr}(x_i)\\ \sigma^2 &\equiv Var[X]\equiv E[(X-\mu)^2]=\sum_{i\in S}(x_i-\mu)^2\text{Pr}(x_i)\\ &= E[X^2]-(E[X])^2 \end{align} $$ > $$ > \begin{align} > E[(aX+b)]&=aE(X)+b\\ > E[X+Y]&=E[X]+E[Y]\\ > E[XY]&=E[X]E[Y]\text{, for independent only}\\ > Var[(aX+b)]&=a^2Var(X)\\ > Var[X+Y]&=Var[X]+Var[Y]\text{, for independent only} > \end{align} > $$ ### Sample mean | Type | Population (fixed) | Sample (random) | |:--------:| :-------------------------------------------- | :-------------------------------------------------------- | | Mean | $\mu=\sum^N_{i=1}\frac{x_i}{N}$ | $\bar{x} = \sum^n_{i=1}\frac{x_i}{n}$ | | Variance | $\sigma^2=\sum^N_{i=1}\frac{(x_i-\mu)^2}{N}$ | $s^2 = \sum^n_{i=1}\frac{(x_i-\bar{x})^2}{n-\textbf{1}}$ | > **Why $n-1$?** > sample variance will underestimate ## Lecture 4 (03/05): Sampling | Type | Sample (random) | Sample Mean (Const.) | | :--------: | --------------------------------------------------- | ------------------------------------------------------- | | Mean | $\bar{x} =\frac{1}{n}\sum^{n}_{i=1}X_i$ | $E[\bar{x}] =\mu_{\bar{x}}=\mu$ | | Variance | $s^2 = \frac{1}{n-1}\sum^{n}_{i=1}(X_i -\bar{x})^2$ | $Var[\bar{x}] = \sigma_{\bar{x}}^2= \frac{\sigma^2}{n}$ | | STD | | $\sigma_{\bar{x}}= \frac{\sigma}{\sqrt{n}}$ | > $\sigma_{\bar{x}}$ is also **standard error** of the sample mean > [Experiment](http://onlinestatbook.com/stat_sim/sampling_dist/index.html) ### Probability Theory #### PDF describing the characteristics of probability distribution. #### Normal distribution (ND) most value is close to the center With mean and STD: $$ \begin{align} &f(x\mid \mu,\sigma) = \frac{e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}}{\sigma\sqrt{2\pi}}, x\in (-\infty,\infty)\\ &X\sim \text{ND}(\mu,\sigma)\\ \end{align} $$ #### Standard normal distribution Standardized the parameter and use table to find the ratio. $$ Z = \frac{X-\mu}{\sigma}\sim \text{ND}(0,1) $$ > Z-Table: left area $P(Z\leq z)$ #### Sampling from normal distribution $$ \bar{x}\sim \text{ND}\left(\mu,\frac{\sigma}{\sqrt{n}}\right) $$ ### Central Limit Theory Sample mean will be approximately ND (no matter how population distributes), when sample size is large enough ($n\geq 25$). #### Distribution of the Sample Variance from ND population If $\text{DOF}=n-1$: $$ \chi^2_{n-1} = \frac{\sum^{n}_{i=1}(x_i -\bar{x})^2}{\sigma^2} = \frac{(n-1)s^2}{\sigma^2} $$ When sample size is large, distribution will close to ND. ### Mean and Variance of Sampling Mean ### Standard Error (std of sample mean) > std for "how **discrete** the data" > standard error for **error of estimation result** > $$ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} $$ - Larger sample size, small standard error - Standard error reflects sampling fluctuation ## Lecture 5 (03/12): Estimation ### Estimate population mean with known population variance **Point estimation:** one value to estimate variable of population **Interval estimation:** use point estimation to measure the property of sampling distribution (construct a interval for parameter of population) > How to conduct 95% confidence level > 1. Calculate std of sample mean: $\sigma_{\overline{X}}=\frac{\sigma}{\sqrt{n}}$ > 2. Derive the interval with known mean (leg is two times $\frac{\sigma}{\sqrt{n}}$) #### Summary - $a<\mu <b$, $a \And b$ is <font color="#f00">random</font> and determine by sampling distribution of $\bar{x}$. - If sample $n$ increase, the $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$ becoming smaller. The estimation will be closer to the population $\mu$ (by smaller interval). $$ P(a<\mu <b) = 1 - \alpha $$ > Confidence level：$1-\alpha$ #### Confidence interval (CI) $$ (\overline{X}- z_{\frac{\alpha}{2}}\times \frac{\sigma}{\sqrt{n}},\ \overline{X}+ z_{\frac{\alpha}{2}}\times \frac{\sigma}{\sqrt{n}}) $$ > Confidence interval = point estimation $\pm$ margin of error Confidence level is not equal to the possibility for parameter in the CI. It's the percentage for constructing the interval covering the average value. ### Estimation population mean with unknown population variance Replace population std ($\sigma$) with $s$ (sample std)? #### Student's t-distribution With degree of freedom $n-1$ Symmetric, centered at 0, bell-shaped. $n$ goes up, close to ND $$ (\overline{X}- t_{\frac{\alpha}{2}}\times \frac{s}{\sqrt{n}},\ \overline{X}+ t_{\frac{\alpha}{2}}\times \frac{s}{\sqrt{n}}) $$ - Use t-distribution only when population variance is unknown - When $n>30$, some use standard normal distribution ($\because$ t-ditrbuton is smaller than standard normal distribution, but ==t-distribution is more accurate== :::warning ![Pasted Graphic](https://hackmd.io/_uploads/H1D0N_66p.jpg =80%x) ::: ## Lecture 6,7,8 (03/19,03/26,04/02): Hypothesis Tset (I,II,III) ### hypothesis testing 1. Null Hypothesis ($H_0$): Assume it is true. If the data collected is nearly impossible when $H_0$ is true, decline the assumption. 2. Alternative Hypothesis ($H_a$ or $H_1$): it is usually what we goning to prove. The significant evidents is needed to establish it. > a not declined $H_0$ does not means it is true (evidents not enough only) > a decline means the evidents observe significance -> which decline $H_0$ ### test of significance > Step 1: Establish null hypothesis > Step 2: Calculation > $$ > Z =\frac{\bar{x}-\mu}{\sigma_{sample}/\sqrt{N}} > $$ Step 3: Reject or not > 1. interval $|\bar{x}-\mu|< d$ > 2. critical value $[\mu-d,\mu+d]$ > 3. Calculate p-value (reject when $p<\alpha$) #### Type I Error & P-value | Decision | $H_0$ is True | $H_0$ is False | | ---------------- | ------------ | ------------- | | Not reject $H_0$ | no error | **Type II error** | | Reject $H_0$ | **Type I error** ($p<\alpha$) | no error | ##### Significant Level ($\alpha$) > ratio that we make type I error > Confidence level: CI$=1-\alpha$ > $$ P(\text{Type I error}) = P(\text{reject }H_0|H_0\text{ is True}) $$ ##### p-value * The possibility that null hypothesis is **TRUE**. (The possibility of False is unknown) * provided confident of rejection * Significant level observed in testing. * If $H_0$ is more reliable, the smaller p-value we needed. * ==P-value is the area under curve:== > summation of the area under curve with `t.cdf()` or `norm.cdf` #### Two-Tailed Test vs One-Tailed Test Two-tailede: in the center. One-taild: smaller(left), larger(right) > for the same p-value, the two-tailed need more significant ### For TWO or more population Difference in **mean**: $$\mu_1-\mu_2\text{ or }\overline{x_1}-\overline{x_2}$$ Standard Error (SE) #### ND, large sample ($>30$) \begin{align} E(\overline{X}-\overline{Y})=\mu_{\overline{X}-\overline{Y}} = \mu_{\overline{X}}-\mu_{\overline{Y}} = \mu_X-\mu_Y\\ Var(\overline{X}-\overline{Y})=\sigma^2_{\overline{X}-\overline{Y}} = \sigma^{2}_{\overline{X}}-\sigma^2_{\overline{Y}} = \frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}\\ (\overline{X}-\overline{Y})\sim ND(\mu_X-\mu_Y,\sqrt{\frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}} ) \end{align} #### ND, small sample ($<30$, t-distribution) 1. population is ND, known population variance: $$(\overline{X}-\overline{Y})\sim ND(\mu_X-\mu_Y,\sqrt{\frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}} )$$ 2. Unknown population variance * same variance: pooled sample variance (dof = $n+m-2$) $$\begin{gather} S_p^2=\frac{(n-1)s_X^2+(m-1)s_Y^2}{n+m-2}\\ s_X^2=\frac{\sum_{i=1}^n(X_i-\overline{X})^2}{n-1}\land s_Y^2=\frac{\sum_{i=1}^m(Y_i-\overline{Y})^2}{m-1}\\ \frac{\overline{X}-\overline{Y}}{\sqrt{S_P^2\left(n_X^{-1}+n_Y^{-1}\right)}}\sim t(n_X+n_Y-2) \end{gather}$$ * different variance: *Welch* $$\begin{gather} S=\sqrt{\frac{{S_1}^2}{n_1}+\frac{{S_2}^2}{n_2}},\ DOF=\frac{\left(\frac{{S_1}^2}{n_1}+\frac{{S_2}^2}{n_2}\right)^2}{\frac{\left({S_1}^2/n_1\right)^2}{n_1-1}+\frac{\left({S_2}^2/n_2\right)^2}{n_2-1}}\\ \frac{\overline{X}-\overline{Y}}{\sqrt{\frac{{S_1}^2}{n_1}+\frac{{S_2}^2}{n_2}}}\sim t\left(DOF\right) \end{gather}$$ > **Example** > How to determine the temperature difference between two cycle? > If the differdnce is small, we can **remove seasonal cycle**. If the **correlation** is established, it is owing to seasonal effect. ### Paired and unpaired test 1. Independent (unpaired) Sample t-test: testing significant differences 2. Paired Samples t-test: testing difference between mean In atmospheric field: **auto-correlation** ### Test two population variance (F-test) :::warning ![image](https://hackmd.io/_uploads/B15X3FIgA.png) ::: #### 卡方分配 for $n-1$ degree (Distribution of sample variance) Closer to ND with larger $n$ (sample size) \begin{align} \chi^2_{n-1} = \frac{\sum_{i=1}^n(x_i-\bar{x})^2}{\sigma^2} = \frac{(n-1)s^2}{\sigma^2} \end{align} > $s$: sample variance; $\sigma$: population variance * If $\alpha=0.5$: Deny $H_0$ for $\chi^2>16.9$ (One-tailed), for $\chi^2\notin[16.1,45.7]$ (Two-tailed) #### F distribution for two population \begin{align} F = \frac{\chi_1^2/(n-1)}{\chi_2^2/(m-1)} = \frac{s_1^2\sigma_2^2}{s_2^2\sigma_1^2} \end{align} > F distribution with $n-1$, $m-1$ dof Notice: if freedom is inversed, the distribution is different. > $$ > F_{(1-\alpha),df_1,df_2} = \left(F_{\alpha,df_1,df_2}\right)^{-1} > $$ > Some textbooks suggest the smaller sample variance is from the population 2 so that the F value is always larger than 1 ## Lecture 9 (04/02,07) Regression ### Correlation & Covariance #### Covariance \begin{align} \text{Var}(X) &= \frac{\sum_{i=1}^N(x_i-\bar{x})^2}{N}\\ \text{Cov}(X,Y) = \sigma_{xy} &= \frac{\sum_{i=1}^N(x_i-\bar{x})(y_i-\bar{y})}{N} \end{align} > sign represents tendency: > $+$ quadrant 1,3 > $-$ quadrant 2,4 > * deviation of two variables * variance is special case (when two variables are the same) * it **CANNOT** guarantee a causal relationship [因果關係] #### Correlation > ONLY how linearly dependent ##### Correlation Coe - compare covariance with variability of variables $$ r\equiv {\sigma_{xy}\over\sigma_x\sigma_y} $$ > $r \in [-1,1]$ > $>0$: positive correlation; $<0$: negative correlation; $=0$: no correlation ### Regression Analysis Linear Regression: predict or evaluate with variables #### Linear Estimation $$ y=ax+b $$ > y: predictand, dependent variables, response variables > x: predictors, independent variables, explanatory variables - Simple or multiple linear regression depend on amount of independent variables - Error / Residuals: $\epsilon_i=y_i-(ax_i+b)$ #### Least square approx. > the best result is unique > $$ \text{Min }S_r = \sum^n_{i=1}\epsilon^2_i=\sum^n_{i=1} (y_i-ax_i-b)^2 $$ $a={n\sum x_iy_i-\sum x_i\sum y_i\over{ n\sum x_i^2}-(\sum x_i)^2}$, $b=y-ax$ #### Multiple linear regression > for more than two independent variables. $$ \text{Var(X)} = {\text{spread of residuals}\over{\text{spread of }x}}={\sigma^2\over{\sum(x_i-\bar{x})^2}} = {{SSE\over{n-2}}\over{\sum(x_i-\bar{x})^2}} $$ > use SSE for estimation because $\sigma$ is unknown Overall quality: by $R^2$ Significant of regression COE [Lecture 10] #### Model Validation (SSE, SST, $R^2$) $$ SST = SSR+SSE $$ > SST 總平方和 > SSR 回歸值平方和 > SSE 殘差平方和 $=\sum_{i=1}^n(y_i-\widehat{y_i})^2$ > ![image](https://hackmd.io/_uploads/r1wR3QUlR.png =40%x) ##### Determination of coe ($R^2$) > how many percent of variation in y is reduced by using x > Notice: **NO** causation between x,y > $$ R^2 = {SSR\over SST} = 1- {SSE\over SST} $$ > $R^2=1$: perfect > $R^2=r^2$ when only one independant variable ## Lecture 10 (04/07) Significant test for regression COE ### Inference in Linear Regression population: $y=\beta_0+\beta_1x$ estimate: $\widehat{y}=\widehat{\beta_0}+\widehat{\beta_1}x$ > Mean of sampling distribution of $\widehat{\beta_0}$, $\widehat{\beta_1}$ is $\beta_0$, $\beta_1$ > standard error of $\widehat{\beta_0}$, $\widehat{\beta_1}$ is related to SSE Estimate variance: \begin{align} \text{Var}(\hat{\beta_1}) &= {{SSE\over{n-2}}\over{\sum(x_i-\bar{x})^2}}\\ SE_{\hat{\beta_1}}&=\frac{s}{\sqrt{s_{xx}}}=\frac{s}{\sqrt{\sum(x_i-\bar{x})^2}} \end{align} #### Hypothesis testing for regression model COE Null ($H_0$): There is **no** relationship between X and Y Alternative ($H_1$): There is **some** relationship between X and Y > $H_0$: $\beta_1 =0$; $H_1$: $\beta_1 \neq 0$ How far we need to prove $\beta_1 \neq 0$? Depend on **SE**. (larger SE for larger differdnces) > **t-statistic**: $t={\hat{\beta_1}-0\over{SE(\hat{\beta_1})}}$ Assume $\beta_1 =0$, the **p-value** can be calculated from t-value (smaller p-value for association existance) Thus, for lots of variables, we can use **correlation matrix**. ### Practice of regression map and correlation map regression map: $\beta_1$ correlation map: $r$